Design and Implementation of a Data Stream Management System
with Complex Event Processing Capabilities
ABSTRACT
2010 National Grants
Computer Science
The world has seen proliferation of data stream applications over the last decade. These applications include network or traffic
monitoring, online trading or transaction monitoring, supply chain management with Radio Frequency Identification (RFID), health
monitoring, data center automation, web click-streams, other military and civilian applications using sensor networks, and many
more. All of these applications are considered to be mission-critical by related organizations and require real-time processing, so that
strategic decisions can be made quickly. The analysis needs of these emerging applications substantiate inherently different design
and implementation specifications compared to those for existing Database Management Systems (DBMS), which are sometimes used
in an ad-hoc fashion to address the processing needs of the listed applications. An emerging system architecture called Data Stream
Management System (DSMS) is better suited for this purpose. The main differences between DSMS and DBMS are mentioned below.
First, DSMS run queries over unbound, fast moving and dynamic data streams usually while the data is in-memory and before the data
is ever stored (persistence is actually optional in DSMS). In DBMS, ad-hoc queries are run over stationary data that is already saved
into the database, which is a problematic assumption for handling data streams that are characterized as being unbound and possibly
bursty. In DSMS, queries are first “registered” with the system and become Continuous Queries (CQ). CQ are relatively static while the
data they process is dynamic. As a result query plan optimization is still an open research field for DSMS as it is extremely challenging
to build, optimize and adapt query plans for unbound and unpredictable datasets. The nature of unpredictability comes from stream
characteristics such as unknown and varying arrival rates, missing tuples, out-of-order arrivals, and ad-hoc dependence on external
data. Second, semantics of a Continuous Query Language (CQL) need richer temporal and spatial clauses than its counterpart called
Structured Query Languages (SQL) in DBMS due to the time-window constraints in data streams. Finally, Complex Event Processing
(CEP) in real-time requires making complex joins for correlations, mixing in-flight data with data stored elsewhere (e.g. databases, data
warehouses) and possibly r [