Sometimes the observed problem resulted from a bug in the data transformation logic in either the batch or real-time sides.
Sometimes the source data in the log files and the emitted events was different, and had to be fixed in the feature platform, or accounted for in the transformation logic.
When it came to reporting, we knew we were being thorough, but how could we ensure our reporting was up-to-date upon request?
There came a point when we realized that we could never meet the need for real-time reports with the ETL driven approach described above.
So what we really did was add a second data integration pipeline to our environment.
Several problems started to become apparent as we lived in this dual pipeline scenario.
Reports hitting these database tables are now showing real-time data.In the diagram above, generated events are sent to queues within Kafka on the left hand side.Then on the right side, Storm takes events from those queues and processes them.There are many high profile users of this approach including Twitter, where Storm was developed.In this approach you consolidate all of the source data to a single stream.For us this means moving entirely to event-based data.