

Typically, a DNS lookup for a virtual-host style domain (e.g., “”) is done.

The Master Pod communicates directly with the S3 clients and builds task lists that split a SELECT request into multiple object fragments for the Worker processes. It is used in conjunction with an S3-compatible object store.
S3 JSON QUERY SOFTWARE
The rest of this article describes Cloudian’s S3-SELECT software that is Kubernetes-managed and deployable in nearly any environment.Īs previously mentioned, Cloudian’s S3-SELECT software is Kubernetes-managed, described by a set of YAML configuration files that deploys two types of Pods, a Master cluster and a Worker cluster. S3 SELECT is useful in all environments, but it is especially advantageous for edge applications where fast decision-making is often required and fewer compute and storage resources are available. For AI/ML and analytics use cases, S3 SELECT offers advantages of reducing network traffic, reducing the compute load of data processing, and reusing the same base object for multiple uses. The S3 SELECT API works on multiple types of structured data: CSV, JSON, Parquet, GZIP, BZIP2. These scenarios show how S3 SELECT enables a basic type of data warehouse functionality.

Later, a user can make an ad hoc query for another subset of the data from that same base object. Instead of retrieving the whole file (which may be 100s of GBs) and then processing it on the client side, the application can use the S3 SELECT API to filter out to a specific date range and only retrieve withdrawals greater than a specific value. Enter the AWS S3 SELECT API that uses SQL syntax to look inside an object and return a subset of that object’s data.Īs a simple example, the below has an excerpt of a CSV file of bank transactions: Instead, users want the advantages of virtually storing all data in one place and also the ability to selectively query that data. As users look to do more analysis and decision-making with the data, however, object stores need to be much more than dumb pipes moving large object blobs back and forth. Object stores are ideal as data lakes because data as objects can be globally referenced for storage and retrieval and the amount of stored data can scale to the exabyte range in a fault-tolerant and cost-effective way. Once stored, different use cases may only want to retrieve a subset of the data – e.g., a specific date range or only 2 column fields. For example, software-generated transaction and application logs can be large files of 100s of GBs.

Object storage systems have been very effective at storing and managing unstructured data, but they also need to be targeted for structured data where APIs can exploit the structure or format of the data. One study projected by the end of 2025 structured data will grow to 32% of new data captured, created, or replicated. What’s less well-known is that though there is more unstructured data currently, structured data is growing in volume and importance. It’s common knowledge that the growth rate of created data is increasing and that there is a distinction between unstructured data like an image where the data is treated as a single entity or blob and structured data where the data is formatted such as CSV, JSON, or Parquet. Object stores need to be more than dumb pipes
