Data Storage requirements overview in IoT scenarios:
- IoT analysis is all about getting insights from raw telemetry data.
- One of the reasons to store telemetry data is to enable the training of machine learning models. Typically, these models require a vast amount of data to work on, before meaningful insights are made.
- It’s common to use the familiar terms hot, warm, cool, and cold in data analysis.
- Hot clearly means a real-time approach is needed.
- Warm can have the same meaning, though perhaps the data is near-real-time, or recent.
- Cool means the flow of data is slow.
- Cold means that the data is stored and not flowing. The cooler the path, the more the data can be batched.
- In short, in IoT scenarios, some data is continuously Streaming that needs to be acted upon immediately, whereas some data is being uploaded in batch manner.
- One of the easiest ways of handling this data duality from the device, is to send data at different frequencies.
- The first kind of messages contain only the telemetry data that needs analyzed in real time.
- The second type of messages, sent at a lower frequency, contain a batch of the telemetry data, and the other metadata that might be needed, for deeper analysis or archiving.
- The IoT Hub routes these two messages to different resources.
- Hot Data Path:
- The IoT device sends specific telemetry data in its own message, which is routed by the IoT Hub for instant analysis and visualization say, using Azure Time Series Insights.
- Alternatively, the analysis could be handled by Azure Stream Analytics, which supports simple SQL language queries, and is extensible via C# or JavaScript UDFs (User-defined functions).
- The hot path requires storage optimized for data availability.
- Cold Data Path:
- The IoT device sends out batched telemetry, and logging data.
- The IoT Hub directs these messages down a route to an Azure storage account.
- Cold path storage is optimized for size (that is, it’s compressed), long-term storage, and low cost.
- The cold path is not optimized for availability.
- If data consists of files, images, recordings, and similar disparate items, then it’s considered unstructured storage.
- If data neatly divides into similar database-like objects, then it’s considered structured.
Azure Cloud Storage options for IoT scenarios:
Blob Storage (Unstructured Data):
- Unstructured Data: Blob Storage: (default storage in Azure Storage account)
- It’s referred to as unstructured storage, that means each entry in the storage doesn’t conform to any particular model.
- For example, one entry might be video, another an audio recording, a third a group of text files, and so on.
- Blob storage is similar to the files and folder's structure you’re used to on your laptop or desktop computer.
- Blob storage, by default, has a general-purpose setting applied. Whatever data you route to the account is stored with reasonable access settings.
- Blob storage can be accessed via API calls. The APIs are available via REST calls, Azure PowerShell, or the Azure CLI. Client libraries are available for .NET, Java, Python, Node.js, and other languages.
- Blob storage is easily accessible, secure, and low-cost.
- 3 Roles for Blobs:
- Block blob:
- When you have a large volume of data, it can be more efficient to access that data if it’s divided into blocks.
- Each block has a unique ID. You have access to this ID.
- Access IDs can use it to read from, and write to, a specific block.
- Block blobs can be written to in parallel, and can be uploaded in any order.
- Basically, block blobs are for handling large amounts of data over a network.
- Page blob:
- Page blobs are there for data that needs frequent read/write access.
- Consider a page blob to be like a remote hard disk. For any data that is a work-in-progress, a page blob is the ideal cloud storage.
- High performance, and low latency, are the key assets of page blobs.
- Append blob:
- An append blob, as its name implies, can only be appended to, and is ideal for log files. A log file is never edited, and just grows and grows! There’s plenty of space in the cloud.
- Block blob:
- Data Security:
- Azure Blob storage is automatically encrypted, no extra cost, without extra setup.
- The system used is called Storage Service Encryption (SSE).
- Data can be secured in-transit, between an app and Azure, using Client-Side Encryption, HTTPS, or SMB 3.0.
Data Lake Storage (Mass Unstructured to be backed up data):
- Unstructured Data: Data Lake Storage:
- Upgrade data storage from Azure Blob to Azure Data Lake storage comes when you have an enormous amount of data.
- To help organize data, a concept called hierarchical namespaces is available in a Data Lake.
- A hierarchical namespace can be used to encapsulate a collection, large or small, of data objects and files.
- It basically adds another level of reference, that is used to make access to the data more efficient.
- Security in Azure Data Lake is on the file, or folder, level, or greater granularity if needed. All the security, and API access, features of Blob storage apply to Data Lake storage.
- Finally, Data Lake analytics, available through REST APIs, are optimized for big data. Your queries should still run in a decent amount of time, even if they’re trawling through a sea of data.
- Blob storage is your go to solution for cloud IoT storage. Upgrade to Data Lake Storage if data organization, security, or analytics performance, become an issue with your Blob storage.
Cosmos DB (Structured Data):
- Structured Data: Cosmos DB:
- An example of structured storage would be a large database, each entry in the database containing similar information, and each entry accessible by a set of similar API calls.
- Cosmos DB resource is a well-structured storage. At the lowest level, a Cosmos DB consists of JSON objects.
- Access to the data in a Cosmos DB resource is made through queries built from API calls. Cosmos DB supports a range of APIs, including SQL API, Mongo API, Gremlin (graph) API, Azure Table API, and the Cassandra API.
Data Consistency:
- We can create multiple replicase across the regions for Cosmos DB. e.g., one region writes and other regions just reads the data. Data replication and consistency of the data is handled by Cosmos DB. There are following types of data consistency:
- Strong Consistency:
- In this case, the read operation will fetch same results every time irrespective of the fact that data is written in one region and others are just reading it.
- In this case, the system waits for an acknowledgment from all locations that they've received the update, before giving the all clear to make the data readable.
- This process ensures worldwide consistency, but comes at the cost of all locations having to wait for the slowest to receive the update. This latency may only be seconds, but the latency exists so should be considered.
- In Strong consistency, every location will get identical data on every read.
- Eventual Consistency:
- In this scenario, each location gets the update when it arrives. This process clearly means some locations might have stale data for a short while, before the local data is updated.
- Note that, updates can arrive out of order.
- Bounded Consistency:
- With Bounded consistency, you set a time threshold, or version update count threshold.
- This threshold is the tolerance of each location for stale data. If a location reads data, only to find the data is outside of the threshold, then the system will wait until a value is available that is within the threshold. For example, if a threshold is set at 20 seconds, then only data that is stale by 20 seconds or less, is acceptable.
- Set this threshold to zero, and you have Strong consistency.
- Session Consistency: (default)
- In this scenario, the write location has immediate access to the updated data.
- The read locations get the data in the right order, but there will be a different latency for each read location.
- Prefix Consistency:
- With this setting, all locations receive the updates in the correct order, with no update being skipped over.
- The Session consistency level, uses this Prefix consistency for all read locations.
Use cases for Cosmos DB:
- A Cosmos DB resource is usually a more expensive option than Blob storage.
- Create a Cosmos DB resource when you have a mass of well-structured, time critical data.
- The case for a Cosmos DB is stronger still, if the data needs to be available in several locations across the globe.
Time Series Insights Service:
- A key feature of telemetry is that it’s time-stamped: time and order are critical elements of the data.
- Built-in Azure Time Series Insights service is for time-based analytics.
- This service both enables you to visualize your data without writing any code, and enables you to perform regular expression-based analytics.
- The routing, visualization, and analytics, can all be done via the Azure portal. For simpler analytics, the UI can be used to create the expressions. For deeper analysis, and for integration with programming languages, there’s a REST-based API.
- Check out the out-of-the-box abilities and features of Time Series Insights, before engaging in more costly options.
- Azure Time Series Insights is a managed service that enables you to store, visualize, and query a large amount of time series data.
- Features:
- Parsing JSON from messages and structures in clean rows and columns.
- Join metadata with IoT device-generated data.
- Manages events data storage:
- With a column-store database
- and both warm and cold storage.
- Query your data directly in the Time Series Insights Explorer.
- Use APIs that to embed time series data into custom applications.