Improving the credibility of the extracted position from a vast collection of job offers with machine learning ensemble methods

Paweł Drozda; Krzysztof Ropiak; Bartosz Nowak; Arkadiusz Talun; Maciej Osowski

doi:10.31648/ts.9319

Improving the credibility of the extracted position from a vast collection of job offers with machine learning ensemble methods

Paweł Drozda

UWM

Krzysztof Ropiak

University of Warmia and Mazury, Olsztyn

Bartosz A. Nowak

University of Warmia and Mazury, Olsztyn

Arkadiusz Talun

Emplocity S.A.

Maciej Osowski

Emplocity S.A.

DOI: https://doi.org/10.31648/ts.9319

Abstract

The main aim of this paper is to evaluate crawlers collecting the job offers from websites. In particular the research is focused on checking the effectiveness of ensemble machine learning methods for the validity of extracted position from the job ads. Moreover, in order to significantly reduce the training time of the algorithms (Random Forests and XGBoost), granularity methods were also tested to significantly reduce the input training dataset. Both methods achieved satisfactory results in accuracy and F1 measures, which exceeded 96%. In addition, granulation reduced the input dataset by more than 99%, and the results obtained were only slightly worse (accuracy between 1% and 5%, F1 between 3% and 8%). Thus, it can be concluded that the considered methods can be used in the evaluation of job web crawlers.

Keywords:

machine learning, web scraping, granularity methods, classification

References

ARTIEMJEW P., ROPIAK K. 2021. A Novel Ensemble Model – The Random Granular Reflections. Fundam. Informaticae, 179(2): 183-203.
Crossref Google Scholar

CHANG Y.J, TSAI K.L., JIANG W.C., LIU M.K. 2023. Content-aware malicious webpage detection using convolutional neural network. In Multimedia Tools and Applications, p. 1-19. https://doi.org/10.1007/s11042-023-15559-8
Crossref Google Scholar

CHEN T., GUESTRIN C.E. 2016. XGBoost: A Scalable Tree Boosting System. In: KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 785-794. https://doi.org/10.1145/2939672.2939785
Crossref Google Scholar

DROZDA P., TALUN A., BUKOWSKI L. 2019. Emplobot – design of the system. In Proceedings of the 28th International Workshop on Concurrency, Specification and Programming. Google Scholar

FINN A., KUSHMERICK N., SMYTH B. 2001. Fact or fiction: Content classification for digital libraries. In Proc. Joint DELOS-NSF Workshop, Personalization Recommender Syst. Digit. Libraries. Google Scholar

HASHEMI M. 2020. Web page classification: a survey of perspectives, gaps, and future directions. Multimed Tools Appl, 79: 11921-11945. https://doi.org/10.1007/s11042-019-08373-8
Crossref Google Scholar

HO T.K. 1995. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition, 1: 278–282. https://doi.org/10.1109/ICDAR.1995.598994
Crossref Google Scholar

KAO A., POTEET S. 2006. Natural Language Processing and Text Mining. Springer, Berlin.
Crossref Google Scholar

KIM Y.S., LEE C.K. 2016. An Empirical Evaluation of Job Classification Using Online Job Advertisements. In AI 2016: Advances in Artificial Intelligence. LNCS, 9992. https://doi.org/10.1007/978-3-319-50127-7_65
Crossref Google Scholar

LEŚNIEWSKI S. 1916. Podstawy ogólnej teoryi mnogości. I. Prace Polskiego Koła Naukowego w Moskwie, Sekcya Matematyczno-Przyrodnicza, No. 2, Zakład Wyd. Popławski. Eng. tr. in S. Leśniewski. 1992. Collected Works. Kluwer, Dodrecht, p. 129-173. Google Scholar

LOTFI C., SRINIVASAN S., ERTZ M., LATROUS I. 2021. Web Scraping Techniques and Applications: A Literature Review. In R. Pal, P.K. Shukla (eds), SCRS Conference Proceedings on Intelligent Systems. SCRS, India, p. 381-394. https://doi.org/10.52458/978-93-91842-08-6-38
Crossref Google Scholar

NOWICKI R.K, STARCZEWSKI J.T. 2017. A new method for classification of imprecise data using fuzzy rough fuzzification. Information Sciences, 414. https://doi.org/10.1016/j.ins.2017.05.049.
Crossref Google Scholar

PARVEZ M.S., TASNEEM K.S.A., RAJENDRA S.S., BODKE K.R. 2018. Analysis of Different Web Data Extraction Techniques. International Conference on Smart City and Emerging Technology (ICSCET), p. 1-7. https://doi.org/10.1109/ICSCET.2018.8537333
Crossref Google Scholar

PAWLAK Z. 1982. Rough sets. International Journal of Computer & Information Sciences, 11: 341–356.
Crossref Google Scholar

POLKOWSKI L. 2007. Granulation of knowledge in decision systems: The approach based on rough inclusions. the method and its applications. LNAI, 4585, proceedings for RSEISP 2007: Rough Sets and Intelligent Systems Paradigms, p. 69-79.
Crossref Google Scholar

QI J. 2012. Random Forest for Bioinformatics. In: Ensemble Machine Learning. Springer, New York. https://doi.org/10.1007/978-1-4419-9326-7_1
Crossref Google Scholar

RABBI J. 2021. How long does it take to land a new job and how to reduce this time. Retrieved from https://www.linkedin.com/pulse/how-long-does-take-land-new-job-reduce-time-juliana (2.03.2021). Google Scholar

ROPIAK K., ARTIEMJEW P. 2018. A Study in Granular Computing: Homogenous Granulation. 24th International Conference, ICIST 2018, Vilnius, Lithuania, October 4-6, pp. 336-346. Proceedings. https://doi.org/10.1007/978-3-319-99972-2_27
Crossref Google Scholar

SHETE D., BOJEWAR S., SANGHVI A. 2021. Survey Paper on Web Content Extraction & Classification. 6th International Conference for Convergence in Technology (I2CT), pp. 1-6. https://doi.org/10.1109/I2CT51068.2021.9417947
Crossref Google Scholar

TALUN A., DROZDA P., BUKOWSKI L., SCHERER R. 2020. FastText and XGBoost ContentBased Classification for Employment Web Scraping. In: Artificial Intelligence and Soft Computing, ICAISC 2020. https://doi.org/10.1007/978-3-030-61534-5_39
Crossref Google Scholar

TREVISO M., LEE J.-U., JI T., VAN AKEN B., CAO Q., CIOSICI M.R., HASSID M., HEAFIELD K., HOOKER S., RAFFEL C., MARTINS P.H., MARTINS A.F.T., FORDE J.Z., MILDER P., SIMPSON E., SLONIM N., DODGE J., STRUBELL E., BALASUBRAMANIAN N., DERCZYNSKI L., GUREVYCH I., SCHWARTZ R. 2023. Efficient Methods for Natural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 11: 826-860. https://doi.org/10.1162/tacl_a_00577
Crossref Google Scholar

ZOU X.-Q., ZHANG P., HUANG C.-Y., BAO X.-G. 2019. Malicious Websites Identification Based on Active-Passive Method. CNCERT 2018. Communications in Computer and Information Science, 970. https://doi.org/10.1007/978-981-13-6621-5_9
Crossref Google Scholar

Download

Published

2023-09-19

Cited by

Drozda, P., Ropiak, K., Nowak, B., Talun, A., & Osowski, M. (2023). Improving the credibility of the extracted position from a vast collection of job offers with machine learning ensemble methods. Technical Sciences, 26(26), 125–140. https://doi.org/10.31648/ts.9319

Paweł Drozda
UWM

Krzysztof Ropiak
University of Warmia and Mazury, Olsztyn

Bartosz A. Nowak
University of Warmia and Mazury, Olsztyn

Arkadiusz Talun
Emplocity S.A.

Maciej Osowski
Emplocity S.A.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Improving the credibility of the extracted position from a vast collection of job offers with machine learning ensemble methods

Paweł Drozda

Krzysztof Ropiak

Bartosz A. Nowak

Arkadiusz Talun

Maciej Osowski

Abstract

Keywords:

References

License

CURRENT ISSUE

Make a Submission

GRANT

Wydawnictwo Uniwersytetu Warmińsko-Mazurskiego w Olsztynie

Copyright

DEKLARACJA DOSTĘPNOŚCI