MRH: A Large-Scale Text Dataset for Web Content Mining

Authors

  • Mohammed Ali Mohammed College of Business Informatics, University of Information Technology and Communications (UOITC), Baghdad, Iraq.
  • Hasan Aqeel Abbood College of Business Informatics, University of Information Technology and Communications (UOITC), Baghdad, Iraq.
  • Raad Mahmood Mohammed College of Business Informatics, University of Information Technology and Communications (UOITC), Baghdad, Iraq.

The amount of information, specifically the information related to website environment, is increasing and becoming larger and larger day by day, thus, playing an important role in the discovery of diverse knowledge on the web. In this paper, our goal is the creation of a new dataset for web content mining that is used in the testing and evaluation of any new system before the production phase. Key characteristics for our dataset are semi-structured, size=4.05 MB, type: text, No. of rows: 298, No. of columns: 6, file type: .csv (comma separated value), domains: Computer, Mathematical, Physics, Chemistry Sciences. Python code will be used to read set of links from set of websites, then read and save the web page content as text of these links. Our dataset discussed based on (Dataset Overview and Scope, Data Quality and Robustness, Utility and Applications), and evaluated and showed the with its robust structure—comprising domain, website, and webpage data—it supports a variety of web content mining applications.

Keywords:

web content mining, MRH Dataset, Text Mining Dataset, Data Collection

[1] P. Sukumar, L. Robert, and S. Yuvaraj, “Review on modern Data Preprocessing techniques in Web usage mining (WUM),” in 2016 international conference on computation system and information technology for sustainable solutions (csitss), 2016, pp. 64–69.

[2] P. Ristoski and H. Paulheim, “Semantic Web in data mining and knowledge discovery: A comprehensive survey,” J. Web Semant., vol. 36, pp. 1–22, 2016.

[3] S. Yadao and A. Vinaya Babu, “Usage of Web Mining for Sales and Corporate Marketing,” in Communication Software and Networks: Proceedings of INDIA 2019, 2021, pp. 55–60.

[4] Mohammed, M. A., Hamid, R. A., & AbdulHussein, R. R. (2024). Data Collection and Preprocessing in Web Usage Mining: Implementation and Analysis. Iraqi Journal for Computers and Informatics, 50(2), 54-74.

[5] Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World-Wide Web: A survey. ACM Sigmod Record, 27(3), 59-74.

[6] [geeksforgeeks], https://www.geeksforgeeks.org/ , Access on 3 March 2025.

[7] [Wolfram MathWorld], https://mathworld.wolfram.com , Access on 4 March 2025.

[8] [chemguide], https://www.chemguide.co.uk , Access on 6 March 2025.

[9] [The Physics Classroom], https://www.physicsclassroom.com , Access on 7 March 2025.

[10] [Socratic] , https://socratic.org , Access on 7 March 2025.

MRH: A Large-Scale Text Dataset for Web Content Mining. (2025). Journal Port Science Research, 8(4), 321-326. https://doi.org/10.36371/port.2025.4.2

How to Cite

MRH: A Large-Scale Text Dataset for Web Content Mining. (2025). Journal Port Science Research, 8(4), 321-326. https://doi.org/10.36371/port.2025.4.2