Stream Manager - Before we go and describe the stream and how do define things and process, let's just look at the APIs that's required to do most of the stuff. Now let's get into the concept of BangDB Stream Manager.
Stream Manager is defined as set of attributes that together form a particular stream event and then operations that can be done on these events. Let's take an example of very simple case where we have few attributes are coming from different streams. For ex; temp, pressure data stream, etc. for any IOT case. Then we can define the stream as following:
{
"schema" : "myschema" ,
"streams" : [
{
"name" : "temp_stream" ,
"type" : 1 ,
"swsz" : 81600 ,
"inpt" : [
] ,
"attr" : [
{
"name" : "temp" ,
"type" : 11 ,
"stat" : 3
} ,
{
"name" : "point" ,
"type" : 9
} ,
{
"name" : "sensor_name" ,
"type" : 5 ,
"kysz" : 18 ,
"sidx" : 1 ,
"stat" : 2
}
]
} ,
{
"name" : "pressure_stream" ,
"type" : 1 ,
"inpt" : [
] ,
"attr" : [
{
"name" : "pressure" ,
"type" : 11 ,
"stat" : 3
} ,
{
"name" : "point" ,
"type" : 9
} ,
{
"name" : "sensor_name" ,
"type" : 5 ,
"kysz" : 18 ,
"sidx" : 1 ,
"stat" : 2
}
]
}
]
}
This is basic structure to define the streams. Let's discuss more on this.
Schema Since for any real use case, we will have to deal with more than single stream. Then we need to define the operations that would be done on these streams. Hence we will need to put all these streams and the operations on the stream data in some wrapper.
We use "schema" as a container for the streams and operations on them. This also allows us to isolate two different schemas within the system. You could also think of schema to segregate based on the different solutions or apps or users within the system, thereby kind of namespace which will ensure sanity across different structures
Stream This is a collection of attributes for a particular data source. For ex: temperature sensor, payment transaction events, telecom cdr data, pizza order-delivery data etc. This is how we will define a stream in simple way:
"name" : stream name, give a name of the stream. DB does name mangling using schema name, hence user doesn't have to bother about it as long as the stream name is unique within schema.
"type" : type of the stream. Even though we ingest data in a stream, due to various processing of data, we would end up creating many other streams as well. There are following types of streams here, denoted by number.
type = 1
means normal direct or raw stream, the one for which we ingest data. This streams gets the data ingested from outside, through agents or any other means. Application simply sends data to this stream.
type = 2
means filter stream, the stream which gets data based on filter defined in the normal or other stream. Once data is ingested in the raw stream, there we may filter the data based on some condition and send data to this filter stream.
type = 3
means joined stream, the stream which gets data based on two or more joins of streams. Any two streams can join on some condition and output could be sent to this joined stream.
type = 4
means entity stream, the stream which gets data from various streams as long term profile data for the entity for long period of time (or forever). More on this later.
"swsz" : Size of sliding window. Each of these streams could also reside in a sliding window. This is not tumbling window but a continuous window, which is more appropriate for stream analytics. Tumbling window is restriction and BangDB doesn't deal with it. The swsz is number of seconds for the sliding window.
We can have as low as 1 sec to as large as many years. However, it's important to set this properly and not go below a day or hour. For use cases where we wish to analyse in 1 sec or 5 sec or 60 sec or min or hour, we can do that using the processing definition which we will discuss later. This swsz size is for the raw stream data and how long would user like to persist it.
Once it slides, data could be archived or simply discarded or sent to other integrated system. We will have more discussion on this later. Default value for the size is 86400 i.e. one day.
"attr" : list of attributes for the event. It also takes other info which dictates how to treat, store the attribute and what aggregations should be done in continuous basis. Following are important metadata that could be associated with the attribute.
"name" : name of the attribute, this will be in the stream of data identifying the attribute.
"type" : type of the attribute. There are following types supported as of now:
5 = string 9 = long 11 = double
"sidx" : Whether we should have secondary index for the attribute or not. 1 for yes and 0(default) for no. This can only be done for string(5) and long(9) types of attribute. This control to the user is good since secondary index will improve query performance but add to resources.
Hence, if query is not going to happen on this then we should ingore it. Note that db in any way if necessary will override and create indexes as required for different scenarios.
"kysz" : key size, this is only for type = 5, string type. It specifies the max size of the key. It's only important when “sidx” is enabled, i.e. secondary index is enabled as it will be used for indexing, hence need to have upper boundary for the size.
"stat" : This will enable the statistics / aggregation for the attribute. There are following options as defined by enum bangdb_stat_type
:
1: count 2: unique count, uses hyperloglog. Should be used for string(5) type attribute 3: running stats. For ex; count, min, max, avg, stddev, covar, skewness, ex_kurtosis 4: two-val stats. For ex; mean_x, mean_y, std_x, std_y, covr etc. As of now it's not enabled in this version. It's coming soon 5: top-k 99: invalid.
Note: For attributes, we can leave every setting for default too, but two info are mandatory namely - "name" and "type". By defining such schema, we are ready to start ingesting the data and we will be able to do the various processing as defined so far. But these are very limited processing and we would like to do much more.