To query streams to retrieve data:

Method : POST

URI : /db/<dbname>/query


{"sql":"sql query..."}

We can use the same API [ please see API # 7 ].

However the sql query will have some specific changes for streams. Some of the examples are here.

  1. To get few rows from a stream [there are different kinds of streams, input, filter, join, entity] the SQL query structure looks like, note where condition, limit etc are optional.
  2. select * from <schema_name>.<stream_name> where <conditions> <limit> <limit_number>
  3. To get few rows from running statistics/aggregated streams for a particular attributethe sql query structure looks like.
  4. select aggr(attribute_name) from <schema_name>.<stream_name> where <conditions> <limit> <limit_number>
  5. To get few rows from running statistics/aggregated for groupby attributes.
  6. The SQL query structure looks like:

    select aggr(attribute_name) from <schema_name>.<stream_name> where <conditions> groupby <limit> <limit_number>

For example, let's register another schema which does little more than previous one. Below is the schema:


Explanation of the schema:

  1. We have a website and we wish to capture few data points for some analysis.
  2. We are capturing vid (visitor id), pgid (page id that the user is on), prod (product id), catid (category of the page/product), price (total cost) and items (num of items).
  3. We wish to compute running statistics of unique visitor count, category count etc… (see the "stat" attribute).
  4. We further wish to compute running groupby aggregations, for ex; unique count of visitors group by catid (category) and pgid (page).
  5. We also wish to predict the total sales using “catr” (computed attribute) using “sales_model”, which is trained using SVM algo with set of attributes /fields (vid, pgid, catid, items).
  6. It's often very common to query for total items sold so far since beginning, total sales since beginning etc.. Although common but these are pretty compute intensive jobs and takes so much time that we end up running it once in a day or so. Within BangDB we can “enty” (entity) which maintains such values always ready. More so, we can do running statistics also on this. Here we wish to keep several such entities like, total number of views so far, total sales so far etc.
  7. We also have graph triple defined in “rels” such that as data in inserted into the stream, the graph.

Let's register the schema using the API as defined above [ POST /stream ]

Now, let's insert some events into the stream using the API.

curl -H "Content-Type: application/json" -d'{"vid":"v1","prod":"p1","catid":"c1","pgid":"pg1","price":123.45,"items":3}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v2","prod":"p1","catid":"c1","pgid":"pg1","price":43.27,"items":2}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v3","prod":"p2","catid":"c1","pgid":"pg2","price":67.98,"items":2}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v3","prod":"p1","catid":"c1","pgid":"pg1","price":27.98,"items":1}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v3","prod":"p3","catid":"c2","pgid":"pg3","price":71.65,"items":2}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v2","prod":"p3","catid":"c2","pgid":"pg3","price":41.65,"items":1}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v1","prod":"p3","catid":"c2","pgid":"pg3","price":42.65,"items":1}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v1","prod":"p2","catid":"c1","pgid":"pg2","price":47.05,"items":1}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v1","prod":"p1","catid":"c1","pgid":"pg1","price":54.75,"items":2}' -X POST

curl -H "Content-Type: application/json" -d'{"vid":"v1","prod":"p2","catid":"c2","pgid":"pg2","price":51.50,"items":1}' -X POST

Now we have inserted 10 events for visitor v1,v2 and v3. Let's now run the query.

To get few rows from a stream

There are different kinds of streams, input, filter, join, entity.

  1. Select all the events from the stream
  2. curl -H "Content-Type: application/json" -d'{"sql":"select * from website.visitor"}' -X POST


  3. To count number of rows in the stream
  4. curl -H "Content-Type: application/json" -d'{"sql":"select count(*) from website.visitor"}' -X POST


       "retval": 10

    Total count is 10.

  5. Select only 2 events
  6. curl -H "Content-Type: application/json" -d'{"sql":"select * from website.visitor limit 2"}' -X POST


  7. Select events where visitor is "v2" and page id is "pg1".
  8. curl -H "Content-Type: application/json" -d'{"sql":"select * from website.visitor where vid = "v2" and pgid = "pg1""}' -X POST
  9. Select events where price items are 3 or more
  10. curl -H "Content-Type: application/json" -d'{"sql":"select * from website.visitor where items >= 3"}' -X POST

    And so on...

  11. To select data from entity stream
  12. curl -H "Content-Type: application/json" -d' {"sql":"select * from website.prod_details"}' -X POST

    As you see, we get for each product (prod), various entities' values since beginning (count or running stats).

    "total_items" = 3 //for p3,
    "uvisit" (unique visit) = 3 //for p3,
    "sales" (running stats for sales) = {"cnt":3,"sum":155.95,"min":41.65,"max":71.65000000000001,"avg":51.98333333333334,"stdd":17.03917055884273,"skew":1.725341767699333,"kurt":0}
  13. To count total items / rows in the entity prod_details
  14. curl -H "Content-Type: application/json" -d' {"sql":"select count(*) from website.prod_details"}' -X POST


       "retval": 3

    The count is 3 Now, let's select some aggregated data. We have running statistics set on various attributes like 'vid', 'catid', 'price' 'items' etc… (wherever “stat” is set).

To get few rows from running statistics/aggregated streams for a particular attribute

curl -H "Content-Type: application/json" -d'{"sql":"select aggr(vid) from website.visitor where st >= 1 and et <= 2648490388199000"}' -X POST

The "st" and "et" are start time and end time in microsec.


As we see there are row for every single min as the running statistics happen with 60 sec gran. Hence one row for every single minute. But we can roll it for as many minute as required. For example, let's rollup completely, since beginning.

curl -H "Content-Type: application/json" -d'{"sql":"select aggr(vid) from website.visitor where st >= 1 and et <= 2648490388199000 rollup 1"}' -X POST



This tells that there are 3 unique vid (visitors), since we are doing UCOUNT on vid therefore it's correct.

We can rollup now every 5 min by using "rollup 5" (since lowest granularity is single minute, hence 5 times of minute would give us 5 minute).

curl -H "Content-Type: application/json" -d'{"sql":"select aggr(vid) from website.visitor where st >= 1 and et <= 2648490388199000 rollup 5"}' -X POST



If you see we have all 3 unique visitors in first 5 min, as we inserted all data at a time.

To get few rows from running statistics/aggregated for groupby attributes

curl -H "Content-Type: application/json" -d' {"sql":"select aggr(vid) from website.visitor groupby catid:pgid"}' -X POST



Here you get COUNT for each catid and pgid group.

We can add filter here, for example querying only for group catid c1 and pgid pg1.

curl -H "Content-Type: application/json" -d' {"sql":"select aggr(vid) from website.visitor where skey = "c1:pg1" groupby catid:pgid"}' -X POST



We can also use rollup here, similar to what we did in previous section. If you see the schema, granularity for groupby aggregation is 300 sec (5 min), hence each row individually will come for 5 min. If we rollup 1 then we will get aggregated values for each group since beginning (one row for each group). If we aggregate with rollup 5 then we will get one row for every 5*5 = 25 min [ since granularity is 5 min ].