The website DB-Engines keeps track of over 350 different data platform technologies, ranging from relational databases to data warehouses, document databases, key-value stores, search engines, time series, graph databases, and more.
My goals in this talk:
This talk covers data platform technologies as a broad swath and does not spend much time covering the merits of individual products with respect to one another.
Often times, "the platform you have" is a perfectly reasonable answer for "Which platform should I choose?" Understanding how (and when!) to use these platforms is my goal for today.
We work for Catallaxy Widgets, a major retailer of fine widgets and widget accessories. Our holdings include hundreds of stores around the world, as well as a major website.
Our IT team is looking to modernize several key systems in the organization and asked for our guidance.
Our current system has worked, but we're experiencing some pain points:
For each of these problem domains, we will look at data platform technologies well-suited for the domain.
Not all of these technologies are necessary and we can certainly make substitutions, but these are solid choices for the job.
Relational databases can serve as either OLTP or OLAP--these are database designs rather than distinct technologies.
There are also technologies dedicated to extending beyond relational OLAP, such as SQL Server Analysis Services and Oracle Essbase.
Product
has Images
, PriceChanges
, and StoreAvailability
as well as attributes like Price
, Title
, and Brand
Short answer: no.
Long answer: an OLTP database may be a good choice for a busy product catalog, as it gives you a correct source system and it allows you to "true up" the document database(s).
I'd recommend using an OLTP system for the shopping cart unless your company is enormous like Amazon.
If you do get to that point, a key-value store or document database can work well for the shopping cart, but be sure to have post-order mechanisms to ensure that products are available, prices were correct, the method of billing was successful, etc. Use message brokers to split apart these systems.
Hadoop is a massive, distributed, batch processing system. Hadoop itself has three key components: the Hadoop Distributed File System (HDFS), the MapReduce library, and the resource allocation engine Yet Another Resource Negotiator (YARN).
The MapReduce library has fallen out of vogue along with pure Hadoop clusters, but the Hadoop ecosystem is thriving, especially Apache Spark.
Spark provides in-memory cluster computing, avoiding MapReduce's reliance on heavy I/O use.
Spark ties into several major cloud technologies, including Databricks, HDInsight / ElasticMapReduce, and Azure Data Factory / AWS Glue.
HDFS opened up the possibility of massive, distributed storage of data, including multi-structured and unstructured data, which typically would not fit well in a classic data warehouse.
The data lake provides a central location for historical storage of a broad array of company data for the purpose of data science and machine learning activities.
Databricks has coined the term Lakehouse to represent the combination of data warehouse and data lake in one managed area.
Graph databases have a niche in the analytics space. Graph databases combine nodes (which represent entities) and edges (which represent connections between entities).
The biggest problem with graph databases is that you can do the same things with relational databases, but with only one concept (the relation) versus two (nodes and edges).
The second-biggest problem with graph databases is that there is no common graph language like SQL or common implementation specs between products.
Message brokers receive messages from producers and send messages to consumers. They provide a logical disconnect between the two.
Stream processing handles messages one at a time (e.g., Kafka Streams, Flink) or in microbatches (Spark Streaming).
There are full-service logging solutions, such as Splunk, Datadog, Loggly, and SumoLogic. These products perform quite well and tend to be accessible for developers and administrators. The downside is that they tend to be quite expensive.
On the other side, open source products exist as well and can be quite powerful when used correctly, but the learning curve tends to be much higher.
As soon as you have two data platform systems, you introduce the need to combine data.
There are three major approaches to data movement: ETL, ELT, and Data Virtualization.
For a long time, the normal pattern for data movement was Extract-Transform-Load (ETL). With the massive increase in data sizes, we have seen a move toward Extract-Load-Transform (ELT).
ETL modifies data during the movement process: Extract data from a system, Transform it in the mover, and then Load the resulting data into your destination.
By contrast, with ELT, we Extract data from a system, Load it into a staging area on the destination, and Transform the data into its final form using the destination's compute resources.
In addition to moving data from system to system, we can virtualize data, making it appear to move while remaining in its current location. Virtualization tools are commonly third-party products which sit on top of several data platform technologies and offer a "single pane of glass" view of databases. Functionality typically includes the ability to join between sources.
The downside to virtualization is that performance typically suffers with larger sets of data.
SQL Server 2019 extends a Microsoft technology called PolyBase, which allows you to virtualize data from a number of different data platform technologies, including Hadoop, Azure Blob Storage, SQL Server, Oracle, MongoDB, Cosmos DB, Spark, DB/2, Excel, and more.
For more, go to https://csmore.info/on/polybase.
One difference between PolyBase and other data virtualization products is that PolyBase enables ELT into SQL Server. You can create an external table from a remote data source and use that to land data into SQL Server.
This has been a look at the data platform space as it stands. This is a fast-changing field with interesting competitors entering and leaving the market regularly.
To learn more, go here:
https://CSmore.info/on/cdp
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact