test1 june 2013

Advantages of Data Warehouse

•The best part of data warehousing is that the information is under the control of users, so that in case the system gets purged over time, information can be easily and safely stored for longer time period.

•Data warehousing leads to proper functioning of support system applications like trend reports, exception reports and the actual performance analyzing reports.

•Because of being different from operational systems, a data warehouse helps in retrieving data without slowing down the operational system.

•Precisely, a data warehouse system proves to be helpful in providing collective information to all its users. It is mainly created to support different analysis, queries that need extensive searching on a larger scale

Reporting Mechanisms

Reporting Tools

Ebay is simple an amazing company. Just the simple fact that they are able to track the number of daily transactions is a testament to their architecture. The volume of traffic and the security required on a day to day basis would bring most companies to their knees.

Ebay experience from the moment users hit the site until moment they make a purchase, from code to data center automation to building new picture-hosting platforms. It can drive traffic to eBay and improving the customer experience.

It does this by operating a two-pronged big data attack consisting of a massive Teradata data warehouse and a fast-growing Hadoop environment. Financial analysts like SQL and more of a WYSIWYG experience.

Hadoop — which stores and processes unstructured data such as server logs,click-throughs and search queries – and make “enormous use” of it.

In late 2010, eBay predicted its Teradata deployment would grow from about 10 petabytes to 20 petabytes (or 20,000 terabytes — equivalent to about 266 years worth of HD video) within a year. Its Hadoop environment is currently storing between 9 and 10 petabytes, but always growing. In fact, the Hadoop environment doubled in size in the past year, in part from more user data streaming in and in part from analysts running lots of Hadoop jobs and creating new, larger data sets that also remain in the system.

This happens both at a broad scale — say, improving the accuracy of its search engine — and also more narrowly around building specific features the data suggests customers would want. For example, Hadoop has proven helpful in deciphering patterns of misspelled words, so now eBay’s search engine knows to look instead for an actual word or product when users type certain queries incorrectly. In the middle, between broad improvements and narrow data-driven features, Hadoop helps eBay find out a lot about how it’s different and how it can become more unique.

Beyond Hadoop’s sweet spot as a batch-processing engine using its native MapReduce framework (i.e., processing large data sets).

Best Practices:

•Partition by Function

•Split Horizontally

•Avoid Distributed Transactions

•Decouple Functions Asynchronously

•Move Processing To Asynchronous Flows

•Virtualize At All Levels, Cache Appropriately

As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost.

You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from.

Metrics on eBay’s main Teradata data warehouse include:

>2 petabytes of user data

10s of 1000s of users

Millions of queries per day

72 nodes

>140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy

100s of production databases being fed in

Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:

6 1/2 petabytes of user data

17 trillion records

150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day

96 nodes

200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)

4.5 petabytes of storage

70% compression

A small number of concurrent users

test1 june 2013 | Page 9