Introduction

The benchmark is specific to the use case of the Streamoid’s product Catalogix on MongoDB. The Catalogix is a platform for managing ecommerce product catalog for several stores, marketplaces etc. for world’s best retailers providing them with AI automations built for retail.

As part of POC, BangDB has been deployed to bring in the convergence of various elements for more efficient storage of data and processing. The major goals for the POC are to assess the improvement in performance vis-à-vis current implementations and reduction in operational (both dev and management) overhead. To achieve this BangDB has implemented a Schema which defines the logic for ingestion of events/data, processing of various elements, some running stats, ETL etc. as part of the stream processing. Further it also defines the logic for auto-updating the underlying graph to maintain the sub-pred-obj triple to allow query to leverage the natural relationships for query and avoid the run time joins etc. There are many other aspects which we have not yet implemented and integrated namely ML of stream, pattern/anomaly detection, actions, etc. which could be later augmented as required.

Further, to evaluate this in an interactive manner, a test portal has been created on Ampere mimicking the UI flow of the Catalogix (from query perspective and not from look-n-feel)

Setup

Data

    Streamoid: Around 25GB database on Prod on MongoDB, so we tried to use 3X overall of this size on BangDB

    1. BangDB:Around 77 GB for Graph and Stream (Overall size 137GB)
    2. [ extra 60GB document, BRS, Logs, stats, entities etc.]
    3. 1M max product in any store
    4. 1900 stores
    5. 650K SMP_Parent, images, status
    6. 9.3M marketplace data
    7. Around 3X+ data as compared to Streamoid Production Database

Machine

    Cloud: GCP
    VM: c2-standard-16 ,
    RAM: 64GB
    CPU: 8 Core, 16vCPU
    Disk: 500GB SSD Persistent disk
    OS: Ubuntu 18.04 LTS
    Arch: x86/64, AMD

Client

    All tests are done using REST APIs. The C++, Java clients could be used for far better performance, but the Catalogix works using REST API hence we used the same.

    Protocol: HTTP 1.1
    Methods: HEAD, GET, POST, PUT, DELETE

    Supported Headers

    • Content-Type: application/json, Content-Type: text/plain
    • Connection: keep-alive
    • Access-Control-Allow-Origin: *
    • Access-Control-Allow-Methods: *
    • Access-Control-Allow-Headers:*
    • Vary :*
    • Access-Control-Max-Age:3600
    • x-bang-api-key: <api_key>

    Binary data or object should be passed with base64 encoding using Content-Type: text/plain

    There is a separate document which describes the queries and API details, attached with this doc

Below are several sections detailing the different runs for performance benchmark.

Queries [ Read Operations ]

The queries used are the ones that are run on Catalogix all the time for various interactions. To compare the performance numbers for these queries with Catalogix and BangDB, following dataset was used.

    Catalogix 200 Store with 134K products
    BangDB 200 Store with 134K products
    BangDB POC Store with 1M products

The queries and their execution time in milli-sec (excluding the Network time) are as follows;

NumQueryMongoDB 134K ProductsBangDB 134K ProductsBangDB 1M Products
1Get variation, style and total products for a store275316318
2Get 50 products for a given store - UI first query15000402601
3Get custom and marketplace data for a given product500309347
4List product for a store in ascending modified date170009761300
5Fetch select attributes for product of a store15000238277
image Analysis

Since we had only these queries being used at Catalogix, so we ran more queries on BangDB and below are their performance numbers (time in milli sec)

NumQuery134K Products1M Products
query 1Get variation, style and total products for a store316318
query 2Get 50 products for a given store - UI first query402601
query 3Get custom and marketplace data for a given product309347
query 4List product for a store in ascending modified date9761360
query 5Fetch all attributes for product of a store238277
query 6Fetch select attributes for product of a store4891200
query 7Fetch data in rel with conditions3901090
query 8Fetch product groups (SMP_Parent) for a store410987
query 9Count total product_groups for a store5111360
query10Chain query to get store, product and SMP data560996
query11Required fields and SMP fields334365
query12Query for total stores count217287
query13Fetch all data for a store node (can have config file and other attribute level details as node properties)284517
query14Fetch details for a particular store316330
query15Count for related stores309347
query16Fetch shared stores SMP data298770
query17Get total num of products276668
query18Fetching 50 random products430495
query19Fetch all product related to a group11901360
query20Filter condition on chain query - returns selected fields6741010
query21Fetch selected data for all images related to a store329659
query22Get all images nodes4561060
query23Fetch status and product_group (SMP_parent) details for a store324560
query24Fetch product details and status details453760
query25Fetch given marketplace data for a store265364
image Analysis

The above were numbers for read operations, now in next section, we will go into write performances.

Write Operations

This section will cover the write operations and its throughput, performance numbers while data ingestions, insert, and updates. Following are the details for the test.

The benchmark test for write was done for Product data, Marketplace data and Status data. The test was run with following details.

    Num of parallel simultaneous connections: 8 [ DB used 8 threads]
    Method: POST, REST

We selected these data for various reasons. Product data write is heavy as it involves 13 write operations and 9 read operations to insert just one record. Whereas Marketplace data is moderate, and it requires just 13 write operations and no read at all. Finally, status data is light weight since it involves only 2 write operations and no read for inserting one status record.

The data was split into multiple parts and the summary of each part as mentioned in the appendix.

Product data


1. product_details_1_1.json - 224.6MB
2. product_details_1_2.json - 224 MB
3. product_details_1_3.json - 111.8 MB
4. product_details_1_4.json - 111.8 MB
5. product_details_1_5.json - 111.9 MB
6. product_details_1_6.json - 111.9 MB
7. product_details_1_7.json - 111.9 MB
8. product_details_1_8.json - 111.9 

Market place data


1. amazon_1.json - 219.9 MB
2. marketplace_1.json - 188.5 MB
3. marketplace_1_3.json - 181.8 MB
4. marketplace_1_4.json - 36.4 MB
5. marketplace_1_5.json - 92.6 MB
6. marketplace_1_6.json - 92.6 MB
7. marketplace_1_7.json - 181.8 MB
8. marketplace_1_8.json - 36.3 MB

status data


1. status_details_1.json - 10.1 MB
2. status_details_1_2.json - 10.1 MB
3. status_details_1_3.json - 6 MB
4. status_details_1_4.json - 9.4 MB
5. status_details_1_5.json - 6 MB
6. status_details_1_6.json - 4 MB
7. status_details_1_7.json - 6 MB
8. status_details_1_8.json - 4 MB

Following are the summary of the test.

DataNum of rowsSize (MB)Time (Sec)IO ops per recordEvent/secIOPSMB/sec
Product10000001068728221374302201.47
MarketPlace760000982172134419574425.71
Status2600005411223636472734.91

Below are the charts depicting the behaviour of the operations from beginning to end.

Product Data

image Analysis

image Analysis

MarketPlace data

image Analysis

image Analysis

Status data

image Analysis

image Analysis

Summary

  • For Catalogix use case perspective, BangDB eliminates the bottleneck of low performance as it’s ~10-15X faster on an average for slower queries over there.
  • BangDB also eliminates the middle app layer where various computations happen, reason for the low performance as app was doing the DB job.
  • For read query, BangDB performs and scales well for even 3X of prod size data.
  • BangDB allows data to grow beyond memory and still work efficiently. This may allow user to maintain and scale gracefully.
  • The write operations with mixed reads as well, scales well. It provides high IOPS for concurrent write of different kind of data in continuous manner. Therefore, it will allow users to perform read and write in random and mixed manner with high performance.
  • Catalogix integration should be simple as there are only 5 queries for which it needs to make changes at the FE level.
  • There are more features withing BangDB that could be utilized in coming days. Verifications, ETL, Filter, running join, CEP, auto ML etc. are few for example.
  • There are quite a few scope of improvements in the implementation too, but we can optimize later.