DynamicCluStream: An algorithm Based on CluStream to Improve Clustering Quality

Document Type : Original Article

Authors

Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Iran

Abstract

Data streams are continuous flows of data objects generated at high rates, requiring real-time processing in a single pass. Clustering algorithms play a vital role in analyzing data streams by grouping similar data samples. Among various time windows for evolving streams, the sliding window method gradually moves over the data, focusing on the most recent information and improving clustering accuracy while reducing memory requirements. The development of distributed computing frameworks like Apache Spark has addressed the limitations of traditional tools in processing big data, including data streams. This paper presents the DynamicCluStream algorithm, an enhancement over Spark-CluStream, which employs a two-phase clustering approach with precise clustering of recent data. The algorithm dynamically determines the number of clusters by merging overlapping clusters during the offline phase, resulting in significant improvements in cluster precision. Experimental results show that it performs up to 47 percent better on average in terms of precision on the CoverType dataset and up to 92 percent better on average in terms of precision on the PowerSupply dataset.  Although the algorithm is slower due to data sample removal and cluster integration, its impact is negligible in a distributed environment.

Keywords

Main Subjects


  • A. Zubaroğlu and V. Atalay, "Data stream clustering: a review," Artificial Intelligence Review, vol. 54, no. 2, pp. 1201-1236, 2021, https://doi.org/10.1007/s10462-020-09874-x.
  • L. Nguyen, Y. K. Woon and W. K. Ng, "A survey on data stream clustering and classification," Knowledge and information systems, vol. 45, pp. 535-569, 2015, https://doi.org/10.1007/s10115-014-0808-1.
  • Carnein and H. Trautmann, "Optimizing data stream representation: An extensive survey on stream clustering algorithms," Business & Information Systems Engineering, vol. 61, pp. 277-297, 2019, https://doi.org/10.1007/s12599-019-00576-5.
  • Mansalis, E. Ntoutsi, N. Pelekis and Y. Theodoridis, "An evaluation of data stream clustering algorithms," Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 11, no. 4, pp. 167-187, 2018., https://doi.org/10.1002/sam.11380.
  • Wang, Z. Wang, Z. Wu, S. Zhang, X. Shi and L. Lu, "Data Stream Clustering: An In-depth Empirical Study," Proceedings of the ACM on Management of Data, vol. 1, no. 2, pp. 1-26, 2023, https://doi.org/10.1145/3589307.
  • Tang, B. He, C. Yu, Y. Li and K. Li, "A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications," In IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 71-91, Jan. 2022, https://doi.org/10.1109/TKDE.2020.2975652.
  • C. Aggarwal, S. Y. Philip, J. Han and J. Wang, "A framework for clustering evolving data streams," in Proceedings 2003 VLDB conference, Elsevier, 2003, https://doi.org/10.1016/B978-012722442-8/50016-1.
  • Ramzan and M. Ayyaz, "A comprehensive review on data stream mining techniques for data classification; and future trends," EPH-International Journal of Science And Engineering, vol. 9, no. 3, pp. 1-29, 2023, https://doi.org/10.53555/ephijse.v9i3.201.
  • Zhang, R. Ramakrishnan and M. Livny, "BIRCH: an efficient data clustering method for very large databases," ACM sigmod record, vol. 25, no. 2, pp. 103-114, 1996, https://doi.org/10.1145/235968.233324.
  • Cao, M. Estert, W. Qian and A. Zhou, "Density-based clustering over an evolving data stream with noise," in Proceedings of the 2006 SIAM international conference on data mining, SIAM, 2006, https://doi.org/10.1137/1.9781611972764.29.
  • Karau, A. Konwinski, P. Wendell and M. Zaharia, Learning spark: lightning-fast big data analysis, O'Reilly Media, Inc., 2015.
  • Backhoff and E. Ntoutsi, "Scalable online-offline stream clustering in apache spark," in 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain, IEEE, 2016, pp. 37-44, 1 https://doi.org/0.1109/ICDMW.2016.0014
  • Zhang, Z. Qian, S. Shen, J. Shi, S. Wang, "Streaming massive electric power data analysis based on spark streaming. in Database Systems for Advanced Applications: DASFAA 2019 International Workshops: BDMS, BDQM, and GDMA, Chiang Mai, Thailand, April 22–25, 2019, Proceedings 24, Springer, 2019, https://doi.org/10.1007/978-3-030-18590-9_14.
  • Wang and Q. Sun, "Research on Clustream Algorithm Based on Spark," in 2017 10th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, IEEE, 2017, pp. 219-222, https://doi.org/10.1109/ISCID.2017.111.
  • Hua, T. Du, S. Qu and G. Mou, "A data stream clustering algorithm based on density and extended grid," in Intelligent Computing Theories and Application: 13th International Conference, ICIC 2017, Liverpool, UK, August 7-10, 2017, Proceedings, Part II 13, 2017. https://doi.org/10.1007/978-3-319-63312-1_61.
  • Sayed, S. Rady, and M. Aref, "Enhancing CluStream algorithm for CLUSTERING big data streaming over sliding window," in 2020 12th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, IEEE, 2020, pp. 108-114, https://doi.org/10.1109/ICEENG45378.2020.9171705.
  • Bagozi, D. Bianchini and V. De Antonellis, "Multi-level and relevance-based parallel clustering of massive data streams in smart manufacturing," Information Sciences, vol. 577, pp. 805-823, 2021, https://doi.org/10.1016/j.ins.2021.08.039.
  • Huang, X. Li and B. Yuan, "A parallel GPU-based approach to clustering very fast data streams," in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 23-32,. https://doi.org/10.1145/2806416.2806545.
  • M. Grua, M. Hoogendoorn, I. Malavolta, P. Lago and A. E. Eiben, "Clustream-GT: Online clustering for personalization in the health domain," in IEEE/WIC/ACM International Conference on Web Intelligence, 2019, pp. 270-275, https://doi.org/10.1145/3350546.3352529.
  • C. Aggarwal, J. Han, J. Wang, P. S. Yu, "A framework for projected clustering of high dimensional data streams," in Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, 2004, pp. 852-863, https://doi.org/10.1016/B978-012088469-8/50075-9.
  • Kumar, A. Singh and R. Singh, "An efficient hybrid-clustream algorithm for stream mining," in 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). Jaipur, India, IEEE, 2017, pp. 430-437, https://doi.org/10.1109/SITIS.2017.77.
  • Arthur and S. Vassilvitskii, "k-means++: The advantages of careful seeding," in Soda, 2007.
  • Faroughi, R. Boostani, H. Tajalizadeh and R. Javidan, "ARD-Stream: An adaptive radius density-based stream clustering," Future Generation Computer Systems, vol. 149, pp. 416-431, 2023, https://doi.org/10.1016/j.future.2023.07.027.
  • Liu, J. He and Y. Chen, "A topic-enhanced dirichlet model for short text stream clustering," Neural Computing and Applications, p. 1-16, 2024, https://doi.org/10.1007/s00521-024-09480-w.
  • Gorrab, F. Ben Rejab, and K. Nouira, "Split incremental clustering algorithm of mixed data stream," Progress in Artificial Intelligence, vol. 13, no. 1, pp. 51-64, 2024, https://doi.org/10.1007/s13748-024-00316-1.
  • Forresi, M. Francia, E. Gallinucci and M. Golfarelli, "Dynamic Stream Clustering for Real-Time Schema Profiling with Dsc+", unpublished, Available at SSRN 4701020.
  • Zubaroğlu and V. Atalay, "Online embedding and clustering of evolving data streams," Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 16, no. 1, pp. 29-44, 2023, https://doi.org/10.1002/sam.11590.
  • Zhou, F. Cao, W. Qian and C. Jin, "Tracking clusters in evolving data streams over sliding windows," Knowledge and Information Systems, vol. 15, pp. 181-214, 2008, https://doi.org/10.1007/s10115-007-0070-x.
  • Youn, J. Shim and S. G. Lee, "Efficient data stream clustering with sliding windows based on locality-sensitive hashing," IEEE Access, vol. 6, pp. 63757-63776, 2018, https://doi.org/10.1109/ACCESS.2018.2877138.
  • R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen and C. Sohler, "Streamkm++ a clustering algorithm for data streams," Journal of Experimental Algorithmics (JEA), vol. 17. pp. 2.1-2.30. 2012, https://doi.org/10.1145/2133803.2184450.
  • Ahsani, M.Y. Sanati and M. Mansoorizadeh, "Improvement of CluStream algorithm using sliding window for the clustering of data streams," in 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE), Mashhad, Islamic Republic of Iran, IEEE, 2021, pp. 434-440, https://doi.org/10.1109/ICCKE54056.2021.9721505.