Catallaxy Services | @feaselkl |
|
Curated SQL | ||
We Speak Linux |
As data size expands, numerous products have entered the data storage market to solve particular pain points. This talk will cover, at a high level, many of the data storage technologies currently available on the market.
The expansion of data sets and increased expectations of businesses for analysis and modeling of data has led developers to create a number of database products to meet those needs. As data professionals, it is incumbent upon us to understand how these tools work and put them to their best use--before somebody else puts them to sub-optimal use.
When you have too much data to fit into Excel.
Big Data is built around four major dimensions:
Data sets too large to fit on a single machine but not large enough to require a massive cluster.
SparkR (R but able to use a Spark cluster's memory) is a good example of a product which thrives in the Medium Data space.
Stress Points:
For each technology, we will:
Relational databases are built off of set theory, a branch of mathematics dedicated to dealing with collections of things.
Multidimensional databases are used for reporting and business analysis and are made up of several parts.
Hadoop is a massive, distributed, batch processing system.
A columnstore table stores sections of columnar data rather than rows of data.
Key-value pairs stored in memory on a RAM-heavy cache server. CPU count is not necessarily important for this server, but RAM is.
Data stored as key-value pairs. This data may be modelled with more complex data types.
Document databases are sub-sets of Key-Value databases where the value is an element with an internal structure (e.g., JSON or XML). This is designed for nesting and holding an entire object's structure in one record.
Graph databases store associative data as edges and nodes, where edges are first-class citizens.
Distributed database used for text searches.
A publisher/subscriber system with storage of messages in a queue. Messages may get removed after processing (e.g., MSMQ, Service Broker), or they may drop off the queue after a certain amount of time (e.g., Kafka, Kinesis).
A distributed real-time computation system. Integrates with a message queue and performs some set of actions with the data, like calculating aggregations or cleansing data.
Most of the tools discussed here have Microsoft Azure Platform-as-a-Service or Software-as-a-Service versions available. We discussed some of them in earlier slides, but this section will cover Azure versions of all of the tools.
Microsoft is pushing the concept of a data lake: a collection of different data stores in different formats accessable through a common language (U-SQL).
We all know about web applications, thick clients, and tools like Excel. Here are a couple interesting subsets of tools which help understand and visualize the data we're storing.
There is a plethora of data storage methods available to you. Choose the one(s) best-suited for your organization and data needs.
To learn more, go here: http://CSmore.info/on/bigdata
And for help, contact me: feasel@catallaxyservices.com | @feaselkl