Recently, we create a mysql data warehouse which is based on message queue.

Most companies must prepare for particular queries in their systems if they consider to split their databases or tables into many pieces.

some problems should be solved in this situation:

1. how to get correct results in-time
2. how to build strong data warehouse for future analyst

These policies were used by YHD

They have already deployed a middle-ware layer to support these requests (between web apps and databases). Every aggregation SQL was splited into many small SQLs and runs in every data nodes.The Final result is the aggregation of these all small SQLs. In this procedure, everything was computed in memory to get high performance.

In data warehouse layer, they use self-defined ETL tools to extract data from different databases to oracle-Exadata platform. Log-based data was put into hadoop and hbase.

I found a new solution

With Canal and Roma (visit previous PDF roma system) , we could build a data warehouse which is based on metaQ. (metaQ is the final storage of roma) , so we can put some simple queries on this data warehouse directly.

We could use MySQL to build this Data warehouse and use original replication in these databases (everything is simple, especially using multiple source feature via MariaDB).

Disadvantages of this architecture: MySQL database is not the best choice for data warehouse. So we need another analyst platform to handle other log-based data.

Most BI systems were built by very expensive commercial software . For small and medium sized companies, this architecture can save a lot of costs.

Client to aggregate messages:

etl_roma

split and merge aggregation :

split1split2

total architecture:

DW_arch