Every now and then, I like to spend some time on reading articles and papers about developing and administrating database information systems.Today I came across the Lisa conference (Large Installation System Administration) and particularly on the paper On Designing and Deploying Internet-Scale Services where you can read a bunch of important advices on how to develop and administrate a big online platform and many more Also if you have interest you can look at the other participants in the conference Lisa 2007 .
Google are developing their own distributed storage system which has really interesting structure.It is described in the paper:
“Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.”
Update: Thanks to a friend I get to know the MapReduce model of Google for simplified data processing on large clusters.
“MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. “