MRH: A Large-Scale Text Dataset for Web Content Mining
Authors
The amount of information, specifically the information related to website environment, is increasing and becoming larger and larger day by day, thus, playing an important role in the discovery of diverse knowledge on the web. In this paper, our goal is the creation of a new dataset for web content mining that is used in the testing and evaluation of any new system before the production phase. Key characteristics for our dataset are semi-structured, size=4.05 MB, type: text, No. of rows: 298, No. of columns: 6, file type: .csv (comma separated value), domains: Computer, Mathematical, Physics, Chemistry Sciences. Python code will be used to read set of links from set of websites, then read and save the web page content as text of these links. Our dataset discussed based on (Dataset Overview and Scope, Data Quality and Robustness, Utility and Applications), and evaluated and showed the with its robust structure—comprising domain, website, and webpage data—it supports a variety of web content mining applications.
Keywords:
web content mining, MRH Dataset, Text Mining Dataset, Data Collection[1] P. Sukumar, L. Robert, and S. Yuvaraj, “Review on modern Data Preprocessing techniques in Web usage mining (WUM),” in 2016 international conference on computation system and information technology for sustainable solutions (csitss), 2016, pp. 64–69.
[2] P. Ristoski and H. Paulheim, “Semantic Web in data mining and knowledge discovery: A comprehensive survey,” J. Web Semant., vol. 36, pp. 1–22, 2016.
[3] S. Yadao and A. Vinaya Babu, “Usage of Web Mining for Sales and Corporate Marketing,” in Communication Software and Networks: Proceedings of INDIA 2019, 2021, pp. 55–60.
[4] Mohammed, M. A., Hamid, R. A., & AbdulHussein, R. R. (2024). Data Collection and Preprocessing in Web Usage Mining: Implementation and Analysis. Iraqi Journal for Computers and Informatics, 50(2), 54-74.
[5] Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World-Wide Web: A survey. ACM Sigmod Record, 27(3), 59-74.
[6] [geeksforgeeks], https://www.geeksforgeeks.org/ , Access on 3 March 2025.
[7] [Wolfram MathWorld], https://mathworld.wolfram.com , Access on 4 March 2025.
[8] [chemguide], https://www.chemguide.co.uk , Access on 6 March 2025.
[9] [The Physics Classroom], https://www.physicsclassroom.com , Access on 7 March 2025.
[10] [Socratic] , https://socratic.org , Access on 7 March 2025.
[11] Waleed Al-Jawhar, Abbas Hasan Kattoush, Sulaiman M Abbas, Ali T Shaheen “ A high performance parallel Radon based OFDM transceiver design and simulation” Digital Signal Processing, Vol. 18, Issue 6, PP. 907—918, 2008.
[12] W. A. Mahmoud, J J. Stephan and A. A. Razzak “ Facial Expression Recognition Using Fast Walidlet Hybrid Transform” Journal port Science Researchو Volume3, No:1, 2020.
[13] AHM Al-Heladi, WA Mahmoud, HA Hali, AF Fadhel “Multispectral Image Fusion using Walidlet Transform” Advances in Modelling and Analysis B, Volume 52, Issue 1-2, Pages 1-20, 2009.
[14] Ali Akram Abdul-Kareem, Waleed Ameen Mahmoud Al-Jawher “Hybrid image encryption algorithm based on compressive sensing, gray wolf optimization, and chaos” Journal of Electronic Imaging, Volume 32, Issue 4, Pages 043038-043038, 2023.
[15] Waleed A Mahmoud Al-Jawher, Sarah H Awad “A proposed brain tumor detection algorithm using Multi wavelet Transform (MWT)” Materials Today: Proceedings, Volume 65, Pages 2731-2737, 2022.
[16] Ali Akram Abdul-Kareem, Waleed Ameen Mahmoud Al-Jawher “WAM 3D discrete chaotic map for secure communication applications” International Journal of Innovative Computing, Volume 13,m Issue 1-2, Pages 45-54, 2022.
[17] Maryam I Mousa Al-Khuzaie, Waleed A Mahmoud Al-Jawher “ Enhancing Brain Tumor Classification with a Novel Three-Dimensional Convolutional Neural Network (3D-CNN) Fusion Model” Journal Port Science Research, Volume 7, Issue 3, Pages 254-267, 2024.
[18] Maryam I Mousa Al-Khuzaay, Waleed A Mahmoud Al-Jawher ”New Proposed Mixed Transforms: CAW and FAW and Their Application in Medical Image Classification” International Journal of Innovative Computing, Volume 13, Issue 1-2, Pages 15-21, 2022.
[19] Rasha Ali Dihin, Waleed A Mahmoud Al-Jawher, Ebtesam N AlShemmary “Diabetic retinopathy image classification using shift window transformer” International Journal of Innovative Computing, Volume 13, Issue 1-2, Pages 23-29, 2022.
[20] Waleed A Mahmoud Al-Jawher, Shaimaa A Shaaban “K-Mean Based Hyper-Metaheuristic Grey Wolf and Cuckoo Search Optimizers for Automatic MRI Medical Image Clustering” Journal Port Science Research, Volume ,7, Pages 109-120, 2024.
[21] Waleed A Mahmoud Al-Jawher, SHAYMAA ABDULELAH ABBAS SHABAN “Clustering OF Medical Images Using Multiwavelet Transform AND K-Means Algorithm” Journal Port Science Research, Volume 5, Issue 1, Pages 35-42, 2022.
[22] W. A. Mahmoud, Jane Jaleel Stephan and A. A. W. Razzak “Facial Expression Recognition from Video Sequence Using Self Organizing Feature Map” Journal port Science Researchو TRANSACTION ON ENGINEERING, TECHNOLOGY AND THEIR APPLICATIONS, Volume 4, Issue 2, 2021.
[23] Waleed A. Mahmud Al-Jouhar, Dr. Talib M. Jawad Abbas Al-Talib, R. Hamudi A Salman “Fingerprint Image Recognition Using Walidlet Transform” Australian Journal of Basic and Applied Sciences, Australia, 2012.
[24] Walid A Mahmoud, Majed E Alneby, Wael H Zayer “ Multiwavelet Transform and Multi-Dimension -Two Activation Function Wavelet Network Using for Person Identification” IJCCCE, Volume 11, Issue 1, Pages 46-61, 2011.
[25] Waleed Ameen Mahmoud, Ommama Razzak “Speech recognition using new structure for 3D neural network” University of Technology, 1st Computer Conference, https://cs. uotechnology.edu.iq/index.php/112-about-dept-en/394-conf-2010, Pages . 161-171, 2010.
[26] Waleed. A. .Mahmoud, A. Barsoum and Entather Mahos “Fuzzy Wavenet (FWN) classifier for medical images” Al-Khwarizmi Engineering Journal, Volume 1, Issue 2, Pages 1-13, 2005.
[27] Lamyaa Fahem Katran, Ebtesam N AlShemmary, Waleed Ameen Al Jawher “Deep Learning's Impact on MRI Image Analysis: A Comprehensive Survey” Texas Journal of Engineering and Technology, Volume 25, Pages 63-80, 2023.
[28] Haqi Khalid, Shaiful Jahari Hashim, F. Hashim, Waleed Ameen Mahmoud Al-Jawher, Muhammad Akmal Chaudhary, Hamza HM Altarturi “Raven: Robust anonymous vehicular end-to-end encryption and efficient mutual authentication for post-quantum intelligent transportation systems” IEEE Transactions on Intelligent Transportation Systems, 2024.
[29] Qutaiba K Abed, Waleed A Mahmoud Al-Jawher “‘Optimized color image encryption using arnold transform, URUK chaotic map and GWO algorithm” Journal Port Science Research, Volume 7, Issue 3, Pages . 219-236, 2024.
[30] Waleed A. Mahmoud, Ahmed S Hadi “ Systolic Array for Realization of Discrete Wavelet Transform” Journal of Engineering, Volume 13, Issue 02, Pages 1368-1377, 2007.
License
Copyright (c) 2025 Journal Port Science Research

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
How to Cite
- Published: 2025-08-03
- Issue: Vol. 8 No. 4 (2025): TRANSACTION ON ENGINEERING TECHNOLOGY AND THEIR APPLICATIONS
- Section: Articles


