Archive:Hardware and hosting report: Difference between revisions

From Wikimedia Foundation Governance Wiki
Content deleted Content added
Anthere (talk | contribs)
m Fixing links to Wikimedia projects and applying protocol-relative URL
(12 intermediate revisions by 6 users not shown)
Line 1: Line 1:
==2006 Q3 Report ==
==2005 November report : avoiding becoming Lohipedia==
=== General report===
''By Domas Mituzas'' - August 2006, <small>extract from his report on both hardware & software, originally posted to [//lists.wikimedia.org/pipermail/foundation-l/2006-August/022676.html foundation-l]

One of good news is that we can still stay at same class of database
servers, which even are getting much cheaper than before.
Database server cost per unit went from $15000 in Jun, 2005 to $12500
in October, 2005, to $9070 in March, 2006.
We got four of these servers in March and called them... db1, db2,
db3 and db4.

For application environment we did a single $100000 purchase, that
provided us with 40 high performance servers (with two dual core
opteron processors and 4GB of RAM each).
This nearly doubled our CPU capacity, and also provided enough of
space for revision storage, in-memory caching, etc.

For our current caching layer expansion we ordered 20 high
performance servers (8GB memory, four fast disks, $3300 each), which
should appear in production in ~one month.
We're investigating possibilities of adding more hardware in
Amsterdam cluster. We might end up with 10 additional cache servers
there too.

We also purchased $40000-worth of Foundry hardware, based on their
BigIron RX-8 platform.
We will use that as our highly available core routing layer, as well
as connectivity for most demanding servers.
As well, this will allow flexible networking with upstream providers.

Our next purchase will be image hosting/archival systems, and now
there's still ongoing investigation whether to use our previous
approach (big cheap server with lots of big cheap disks), or to
deploy some storage appliance.

We reallocated some aging servers to search cluster and other
auxiliary, and still continue this practice, so that we'd end up with
more homogenous application environment.

==2005 Q4 Report ==

===General report===


''By Domas Mituzas'' - November 2005
''By Domas Mituzas'' - November 2005


[[Image:DevHeadquarters.jpg|left|200px]] These months were yet again amazing in Wikimedia growth history.
Hello, just a shameless copy-paste from meta (http:// meta.wikimedia.org/wiki/Cluster_report%2C_September-November%2C_2005)

These months were yet again amazing in Wikimedia growth history.
Since September request rates doubled, lots of information added, modified and expanded, more users came.
Since September request rates doubled, lots of information added, modified and expanded, more users came.
To deal with that site had to improve both software and hardware platforms again.
To deal with that site had to improve both software and hardware platforms again.
Line 14: Line 53:
40 dual-opteron application servers have been deployed, conserving our limited colocation space, as well as providing lots of performance for a buck.
40 dual-opteron application servers have been deployed, conserving our limited colocation space, as well as providing lots of performance for a buck.
One batch of them (20) was deployed just this week.
One batch of them (20) was deployed just this week.
They're equipped with larger drives and more memory, thus allowing to place various unplanned services on them (9 apache servers are storing old revisions as well), some servers participate in shared memory pool, running memcached.
They're equipped with larger drives and more memory, thus allowing to place various unplanned services on them (9 Apache servers are storing old revisions as well), some servers participate in shared memory pool, running memcached.


One of really efficient purchases was 12k$ worth image server 'amane', providing us with storage space and even ability to to backup at current loads.
One of really efficient purchases was 12k$ worth image server 'amane', providing us with storage space and even ability to backup at current loads.
It is running now highly efficient and lightweight HTTP server - lighttpd.
It is running now highly efficient and lightweight HTTP server - lighttpd.
So far images are served, but growth of Wikimedia Commons will force us to find a really scalable and reliable way to handle lots of media.
So far images are served, but growth of Wikimedia Commons will force us to find a really scalable and reliable way to handle lots of media.
Line 45: Line 84:


Cheers,
Cheers,
Domas
Domas


===Links to last purchases===
==2005 September report==

*http://meta.wikimedia.org/wiki/Hardware_ordered_September_14%2C_2005
*http://meta.wikimedia.org/wiki/Hardware_ordered_October_6%2C_2005
*http://meta.wikimedia.org/wiki/Hardware_ordered_October_18%2C_2005
*http://meta.wikimedia.org/wiki/Hardware_ordered_November_15%2C_2005

==2005 Q3 Report==


===General report===
===General report===
Line 65: Line 111:
===Multilingual error messages finally implemented !===
===Multilingual error messages finally implemented !===


On the 28 of september, '''Mark Ryan''' announced that multilingual messages had now been implemented on the Wikimedia squids. Here is an incomplete list of those in IRC who helped with translations: taw, Mackowaty, WarX, SuiSui, aoineko, Submarine, Rama, Frieda, Quistnix, galwaygirl, Fenix, mnemo and avatar. Particular thanks must go to fuddlemark for extensive Javascript help, and to Jeronim for implementing the new message across the squids. Everyone's help
On the 28 of September, '''Mark Ryan''' announced that multilingual messages had now been implemented on the Wikimedia squids. Here is an incomplete list of those in IRC who helped with translations: taw, Mackowaty, WarX, SuiSui, aoineko, Submarine, Rama, Frieda, Quistnix, galwaygirl, Fenix, mnemo and avatar. Particular thanks must go to fuddlemark for extensive Javascript help, and to Jeronim for implementing the new message across the squids. Everyone's help
has been greatly appreciated. :)
has been greatly appreciated. :)


Now, we just hope not to see these messages too often...
Now, we just hope not to see these messages too often...


==2005 Q2 report==
==2005 Q2 Report==


===Wikimedia servers gets a new facility===
===Announcements===

*'''Wikimedia servers gets a new facility'''
*:On June 7, 2005 (UTC), Wikimedia cluster was moved to another facility, as there would be more space. The newer facility is better designed and in the same city of former one, in Tampa, Florida. Moving had to be done all at once, with all network and servers turned off and moved across the street. It took nearly 11 hours from 07:00 UTC, 03:00 at the local time. ''<small>[[m:User:Midom|Domas Mituzas]]</small>''
*:On June 7, 2005 (UTC), Wikimedia cluster was moved to another facility, as there would be more space. The newer facility is better designed and in the same city of former one, in Tampa, Florida. Moving had to be done all at once, with all network and servers turned off and moved across the street. It took nearly 11 hours from 07:00 UTC, 03:00 at the local time. ''<small>[[m:User:Midom|Domas Mituzas]]</small>''


*'''New appointements on the Foundation team'''
===New appointements on the Foundation team===
*:On the 25th of june, Jimbo Wales has announced the appointement of the following people on official positions within the [[m:Wikimedia Foundation organigram|Foundation organigram]]. In particular<br>
*:On the 25th of June, Jimbo Wales has announced the appointement of the following people on official positions within the [[m:Wikimedia Foundation organigram|Foundation organigram]]. In particular<br>
*:*Chief Technical Officer (servers and development): Brion Vibber
*:*Chief Technical Officer (servers and development): Brion Vibber
*:*Hardware Officer: Domas Mituzas
*:*Hardware Officer: Domas Mituzas
*:As Jimmy Wales best put it, the board encourages these people to work closely with, and even helps to formulate committees within Wikimedia. These appointed positions do not have any special power within any of those groups, but serve as a point of contact to the Board, and to the community, to ensure that information is flowing between all concerned parties within their own fields of xpertise. The appointment is a reflection of the work these people are already doing in these areas, and should not be seen as a disincentive to others to become involved. ''
*:As Jimmy Wales best put it, the board encourages these people to work closely with, and even helps to formulate committees within Wikimedia. These appointed positions do not have any special power within any of those groups, but serve as a point of contact to the Board, and to the community, to ensure that information is flowing between all concerned parties within their own fields of xpertise. The appointment is a reflection of the work these people are already doing in these areas, and should not be seen as a disincentive to others to become involved. ''

==2005 Q1 report==

===General report===
{| style="float: right; padding: 5px;"
| [[Image:LCD layers.png|150px|right|LCD layers]]
|}

January and February saw a number of slowdowns and, on one occasion, a complete shutdown of Wikimedia sites. These were due to a variety of reasons. Many individual servers broke; 10 machines in the main cluster were fixed in the first quarter alone; and traffic continues to rise. The colocation facility had a massive power failure in February, leading to two days of downtime and read-only availability. As of the start of Q2, almost all servers are back in action, and the cluster is looking healthy.

Developers have recently started a [http://www.livejournal.com/community/wikitech/ LiveJournal] as a way to communicate about servers issues with the community. That's one more feed for your preferred RSS aggregator.

====Power outages====
There were two major power outages in the first quarter. The first outage, around February 21st, was due to a [[m:February_2005_server_crash|double power failure]]: two different power supplies to our cluster were switched off at the same time, when some of the internal switches in our colocation facility failed. Some databases were corrupted by the sudden loss of power; the surviving database had not been completely up-to-date with the most current server, and it took almost two days for developers to recover all data. In the meantime, the site was restored to read-only mode after a few hours.

The second outage took place on March 16th due to a human error: one of the master database's hard disks filled up, preventing slaves from being updated. At this point the data cluster had not fully recovered from the previous outage, and there was less than full redundancy among the database slaves. By the time space was made on the disk, the most up-to-date slave was already many hours behind. It took over eight hours of read-only time for the databases to be resynchronized.

====Caches installed near Paris====
<small>''Report from [[:fr:User:David.Monniaux|David Monniaux]].''</small>

[[Image:Paris_servers_DSC00190.jpg|200px|center|Our servers are the three machines in the middle.]]

In December 2004, servers donated to the Wikimedia Foundation were installed at the [http://www.telecity.ie/france.htm Telecity facility] located in [[:en:Aubervilliers|Aubervilliers]] on the outskirts of [[:en:Paris|Paris]], [[:en:France|France]]. The network access is donated by French provider [http://www.lost-oasis.fr Lost Oasis]. In January, the software setup was completed; however, various problems then had to be ironed out.

As of April 1, 2005, those machines cache content in English and French, as well as all multimedia content (images, sounds...), for users located in [[:en:Belgium|Belgium]], France, [[:en:Germany|Germany]], [[:en:Luxembourg|Luxembourg]], [[:en:Switerland|Switerland]], and the [[:en:United Kingdom|United Kingdom]] ([http://bleuenn.wikimedia.org:8080/country-stats/ daily stats per country]). The caches work as follows: if they hold the requested page in their local memory, they serve it directly; otherwise, they forward the request to the main Florida servers, and memorizes the answer while passing it to the browser of the Wikipedia user. Typically, for text content, 80% of accesses are cached (that is, they are served directly); the proportion climbs to 90-95% for image accesses. Due to the current way that the Mediawiki software works, content is cached much more efficiently for anonymous users: essentially, all text pages have to be requested from Florida for logged-in users.

The interest of such caches is twofold:
* First, they relieve the load on the main Wikimedia Florida servers. We have to buy our bandwidth (network capacity) for Florida, whereas we can get (smaller) bandwidth chunks in other locations.
* Second, they make browsing much quicker and responsive, at least for anonymous users. Any access to the Florida servers from Europe may take 100-150&nbsp;ms round trip; this means that retrieving a complete page may take a significant fraction of a second, even if the servers respond instantaneously. The Paris servers, on the other hand, have much smaller roundtrip times from the countries they serve.

The Paris caches serve as a production experiment and test bed for future cache developments, which are currently being studied. We may, for instance, change the caching software in order to reduce the load on the caches (currently, with all the countries they serve, the machines are [http://bleuenn.wikimedia.org:8080/ganglia/ loaded] 80-95%; the machines are, however, quite outdated), and see how we may improve efficiency and cache rates (it appears that the caches do not perform as efficiently as they should by fetching data from each other).

====Blocking of open proxies====
Since March 28th, Wikipedia has been automatically blocking edits coming from open proxies. The feature is still in testing; details are being worked out
[http://meta.wikimedia.org/wiki/Proxy_blocking on the Meta-wiki].

====Jimmy Wales asks for more developers at FOSDEM====
As the opening speaker at the FOSDEM 2005 conference in Brussels, Jimmy Wales appealed to the development community for support with the technical side of running Wikipedia. Analyses of these remarks were published in [[w:en:Wikipedia:Wikipedia Signpost/2005-03-07/FOSDEM|several places]] last week.


===Announcements===

*'''Wikimedia server was down for several hours on 17 March.'''
*:On 17 March there was read only service for several hours after a disk drive used for logging on the master database server became full. Monitoring tools showed combined space for all drives and the last human check of that drive alone showed apparently sufficient space available. Improved montoring and larger disk drives are being obtained. ''<small>James Day</small>'' <br>
*:[[w:en:Wikipedia:Wikipedia Signpost/2005-03-21/Database disk space|''More downtime analysis.'']]

*'''Wikimedia servers down on February 22.'''
*:On February 22, two circuit breakers blew, removing power from most of the Wikimedia servers and leading to the loss of all service for several hours, no editing for most of a day and slowness for a week. Full recovery of database robustness took the better part of a month. Uninterruptible power supplies could have reduced the effect of this incident (but not all power incidents; law requires an emergency power-off switch, which has caused outages for other sites, most notably for LiveJournal not long ago). Additional UPS systems will be used for key systems, fire code willing. <small>-''[[meta:User:JamesDay|James Day]]''</small>


==Links==
==Links==


*http://wikitech.leuksman.com/view/Server_admin_log
*http://wikitech.wikimedia.org/view/Server_admin_log
*[[m:servers]] : updated list of our servers
*[[m:servers]] : updated list of our servers
*[[m:Developer]] : list of developers involved in software and hardware management.
*[[m:Developer]] : list of developers involved in software and hardware management.
*[http://www.livejournal.com/community/wikitech/ LiveJournal]: a way to communicate about servers issues with the community. That's one more feed for your preferred RSS aggregator.


==Archives==
==Archives==
*[[/Archives 2005]]
*[[/Archives 2005]]
*[[/Archives 2004]]
*[[/Archives 2004]]
[[Category:English]]

{{Category:Hardware report}}
{{Category:Hardware report}}

Revision as of 15:45, 25 March 2013

2006 Q3 Report

General report

By Domas Mituzas - August 2006, extract from his report on both hardware & software, originally posted to foundation-l

One of good news is that we can still stay at same class of database servers, which even are getting much cheaper than before. Database server cost per unit went from $15000 in Jun, 2005 to $12500 in October, 2005, to $9070 in March, 2006. We got four of these servers in March and called them... db1, db2, db3 and db4.

For application environment we did a single $100000 purchase, that provided us with 40 high performance servers (with two dual core opteron processors and 4GB of RAM each). This nearly doubled our CPU capacity, and also provided enough of space for revision storage, in-memory caching, etc.

For our current caching layer expansion we ordered 20 high performance servers (8GB memory, four fast disks, $3300 each), which should appear in production in ~one month. We're investigating possibilities of adding more hardware in Amsterdam cluster. We might end up with 10 additional cache servers there too.

We also purchased $40000-worth of Foundry hardware, based on their BigIron RX-8 platform. We will use that as our highly available core routing layer, as well as connectivity for most demanding servers. As well, this will allow flexible networking with upstream providers.

Our next purchase will be image hosting/archival systems, and now there's still ongoing investigation whether to use our previous approach (big cheap server with lots of big cheap disks), or to deploy some storage appliance.

We reallocated some aging servers to search cluster and other auxiliary, and still continue this practice, so that we'd end up with more homogenous application environment.

2005 Q4 Report

General report

By Domas Mituzas - November 2005

These months were yet again amazing in Wikimedia growth history.

Since September request rates doubled, lots of information added, modified and expanded, more users came. To deal with that site had to improve both software and hardware platforms again.

Of course, more hardware was thrown at the problem. In mid-September three new database servers (thistle,ixia,lomaria) were added to the pool, removing ancient type of hardware from the service. With data growth rates 'old' 4GB-RAM boxes could not keep up with operation, except quite limited one. 40 dual-opteron application servers have been deployed, conserving our limited colocation space, as well as providing lots of performance for a buck. One batch of them (20) was deployed just this week. They're equipped with larger drives and more memory, thus allowing to place various unplanned services on them (9 Apache servers are storing old revisions as well), some servers participate in shared memory pool, running memcached.

One of really efficient purchases was 12k$ worth image server 'amane', providing us with storage space and even ability to backup at current loads. It is running now highly efficient and lightweight HTTP server - lighttpd. So far images are served, but growth of Wikimedia Commons will force us to find a really scalable and reliable way to handle lots of media.

Additionally 10 more application servers are ordered together with a new Squid cache server batch. These 10 single-opteron boxes will have 4 small and fast disks and should enable efficient caching of content.

As all this gear was bought for donated money, we really appreciate community help here, thank you!

Yahoo supplied cluster in Seoul, Korea has finally got into action, bringing cached content closer to Asian locations, as well as having master databases and application cluster for Japanese, Thai, Korean and Malaysian Wikipedias.

For internal load balancing Perlbal was replaced by LVS, and we've got a nice flashy donated load balancing device that may be deployed into operation soon as well. LVS has to be handled with care and several tiny misconfiguration incidents seriously affected site performance. Lately the cluster has became quite big and complex and now we need more sophisticated and extensive sanity checks and test cases.

There are lots of work in establishing more failover capabilities - we will be having two active links to our main ISP in Florida. Static HTML dump is (becoming) nice and usable and may help us in case of serious crashes. It can be served from Amsterdam cluster as well!

As for last several days we managed to bring cluster into quite proper working shape, now it's important to fix everything and prepare for more load and more growth and yet another expansion. We hope that we will be able with the help of community to solve all our performance and stability issues and avoid being Lohipedia :)

Lots of various problems were solved so far in order to achieve what we have now, and lots of low hanging fruits have been picked. What is dealt now with is complex and needs manpower and fresh ideas as well.

Discussions are always welcome on #wikimedia-tech in Freenode (except during serious downtimes :) .

And, of course, Thanks Team (or rather, Family)! It is amazing to work together!

Cheers, Domas

Links to last purchases

2005 Q3 Report

General report

By Domas Mituzas - September 2005

Domas and Mark
Domas and Mark

Already in March it was clear that we needed more hardware to solve our main performance bottlenecks, but there was lots of hesitation on what to buy. This somewhat ended in mid-April, when we ordered 20 new application server (Apache) boxes, which were deployed in May. Then again, our main performance bottleneck happened to be our database environment, which was resolved by ordering and deploying two shiny new dual-Opteron boxes with 16GB of RAM each, accompanied by an external Just a Bunch of Disks (JBOD) enclosure. In this configuration we eliminated previous bottlenecks, as disk performance and in-memory caches were critical points. These two boxes have already shown to be capable of handling 5000 queries per second each without any sweating and were of great aid during content rebuilds during the MediaWiki 1.5 upgrade (we could run live site without any significant performance issues).

Lots of burden was removed from databases by using some more efficient code, disabling really slow functions and, notably, deployment of the new Lucene search. Lucene can run on cheap Apache boxes instead of our jumbo (well, not that really in enterprise scale) DBs, therefore we could scale up quite a lot since December with the same old, poor boxes. Archives (article history) were also placed on cheap Apache boxes thus freeing expensive space on the database servers. Image server overloads were temporarily resolved by distributing content to several servers, but a more modern content storage system is surely required and planned.

There were several downtimes related to Colocation facility power and network issues, of which the longest one was during our move (on wheels!) to a new facility, where we have more light, power, space and fresh air. Anyway, acute withdrawals were cured by working Wikis.

There was some impressive development outside Florida as well. A new datacenter in Amsterdam, generously supplied by Kennisnet, provided us with a capability to cache content for whole Europe and neighboring regions. Moreover, it enabled us to build distributed DNS infrastructure, and preparations are made to serve static content from there in case of emergencies. Various other distribution schemes are researched as well.

Currently there are preparations made to deploy our content in a Korea datacenter provided by Yahoo. There we sure will use our established caching technology, but we might already take one step further and put our master content servers for regional languages there. As well, further expansion of our existing Florida content-processing facility is thought about.

Multilingual error messages finally implemented !

On the 28 of September, Mark Ryan announced that multilingual messages had now been implemented on the Wikimedia squids. Here is an incomplete list of those in IRC who helped with translations: taw, Mackowaty, WarX, SuiSui, aoineko, Submarine, Rama, Frieda, Quistnix, galwaygirl, Fenix, mnemo and avatar. Particular thanks must go to fuddlemark for extensive Javascript help, and to Jeronim for implementing the new message across the squids. Everyone's help has been greatly appreciated.  :)

Now, we just hope not to see these messages too often...

2005 Q2 Report

Wikimedia servers gets a new facility

  • On June 7, 2005 (UTC), Wikimedia cluster was moved to another facility, as there would be more space. The newer facility is better designed and in the same city of former one, in Tampa, Florida. Moving had to be done all at once, with all network and servers turned off and moved across the street. It took nearly 11 hours from 07:00 UTC, 03:00 at the local time. Domas Mituzas

New appointements on the Foundation team

  • On the 25th of June, Jimbo Wales has announced the appointement of the following people on official positions within the Foundation organigram. In particular
    • Chief Technical Officer (servers and development): Brion Vibber
    • Hardware Officer: Domas Mituzas
    As Jimmy Wales best put it, the board encourages these people to work closely with, and even helps to formulate committees within Wikimedia. These appointed positions do not have any special power within any of those groups, but serve as a point of contact to the Board, and to the community, to ensure that information is flowing between all concerned parties within their own fields of xpertise. The appointment is a reflection of the work these people are already doing in these areas, and should not be seen as a disincentive to others to become involved.

Links

Archives

Category:Hardware report