One of the many challenges involved in architecting “Big Data” solutions is in just moving the stuff around.
Sometimes it makes sense to move the work to the Data (instead of the other way around). This assumes that your Data is already in a form that is amenable to heavy-duty analytics.
More typically, you’re looking at many terabytes, or maybe even petabytes of Data tied up in files on disk.
At these scales, distributed systems are a necessity, not a luxury.
For example, Google™ had to invent their own file system to reliably handle the Data volumes and workloads that indexing the entire World Wide Web entails. The Google File System (GFS) provided extremely high capacity, high performance, and high reliability on low-cost (and failure prone) commodity hardware.
GFS ran across hundreds of locally and geographically distributed server clusters, with some clusters consisting of thousands of servers and aggregating many petabytes of storage, while providing an aggregated 40 Gigabytes per second read-write performance reliably, and in the face of frequent and routine disk and server hardware failures.
I describe GFS in the past tense only because it has in recent years been replaced by “Colossus”, Google’s next-generation distributed file system. Think of Colossus as a more reliable, better performing, more cost-effective, and more flexible version of GFS.
Google has developed a number of Big Data processing tools and solutions that run on top of GFS / Colossus.
For example, Google developed the “MapReduce” programming model on GFS. After Google published a paper describing the MapReduce concept, programmers at Yahoo! built their own version to be called “Hadoop”; the associated distributed file system became known as “HDFS”. By the way, Google has since moved away from MapReduce, replacing it with “Cloud Dataflow”, a managed service that runs on Colossus and other advanced technologies, such as “FlumeJava” (a framework for simple deployment of parallel-Data computations) and “MillWheel” (reliable, high volume Data stream processing).
Google has deployed many other significant Big Data management services on top of GFS/ Colossus, such as “Big Table” (highly scalable NoSQL Database), “BigQuery” (high performance queries against massive, static Data collections), “Pregel” (large scale graph processing), “Spanner” (globally distributed NewSQL Database), and more to come I am sure.
So by standing on the shoulders of giants like Google (and certainly Microsoft, Yahoo! and other companies that are heavily invested in Big Data management solutions), we can see that solution to moving around huge volumes of Data is not to move it around! Instead, distribute it, and then distribute the query and analytic tools that we need to extract value from that Data.