A few days ago, i have to perform aggregation on streaming dataframe. And the moment, i apply groupBy for aggregation, data gets shuffled. Now the situation arises how to maintain order?
Yes, i can use orderBy with streaming dataframe using Spark Structured Streaming, but only in complete mode. There is no way of doing ordering of streaming data in append mode and update mode.
I have tried different ways to solve this issue. Like, if i go with spark structured streaming. I might sort the streamed data in batches but not across batches.
I started finding solutions with different technologies like Apache Flink, Apache storm etc. What i faced at the end is disappointment. 😦
A bit of light at the end of the tunnel
Luckily there is Apache Kafka Stream which provides the facility of accessing its StateStore. Kafka Stream provides Processor API.
The low-level Processor…
View original post 403 more words