9
Advantages of Data Warehouse
•
•
•
•
Reporting Mechanisms
Reporting Tools
Ebay is simple an amazing company. Just the simple fact that they are able to track the number of daily transactions is a testament to their architecture. The volume of traffic and the security required on a day to day basis would bring most companies to their knees.
Ebay experience from the moment users hit the site until moment they make a purchase, from code to data center automation to building new picture-hosting platforms. It can drive traffic to eBay and improving the customer experience.
It does this by operating a two-pronged big data attack consisting of a massive Teradata data warehouse and a fast-growing Hadoop environment. Financial analysts like SQL and more of a WYSIWYG experience.
Hadoop — which stores and processes unstructured data such as server logs,click-throughs and search queries – and make “enormous use” of it.
In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years worth of HD video) within a year. Its Hadoop environment is currently storing between 9 and 10 petabytes, but always growing. In fact, the Hadoop environment doubled in size in the past year, in part from more user data streaming in and in part from analysts running lots of Hadoop jobs and creating new, larger data sets that also remain in the system.
This happens both at a broad scale — say, improving the accuracy of its search engine — and also more narrowly around building specific features the data suggests customers would want. For example, Hadoop has proven helpful in deciphering patterns of misspelled words, so now eBay’s search engine knows to look instead for an actual word or product when users type certain queries incorrectly. In the middle, between broad improvements and narrow data-driven features, Hadoop helps eBay find out a lot about how it’s different and how it can become more unique.
Beyond Hadoop’s sweet spot as a batch-processing engine using its native MapReduce framework (i.e., processing large data sets).
Best Practices:
•
•
•
•
•
•
As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost.
You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from.
Metrics on eBay’s main Teradata data warehouse include:
>2 petabytes of user data
10s of 1000s of users
Millions of queries per day
72 nodes
>140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
100s of production databases being fed in
Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:
6 1/2 petabytes of user data
17 trillion records
150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
96 nodes
200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
4.5 petabytes of storage
70% compression
A small number of concurrent users