Intelligent Information Retrieval from Unstructured Data using Natural Language Processing


  • Muhammad Yusuf Khan Usman Institute of Technology Karachi Pakistan
  • Syed Zain Ali Usman Institute of Technology Karachi Pakistan
  • Muhammad Hassan Sohail Usman Institute of Technology Karachi Pakistan
  • Muhammad Wasim Usman Institute of Technology Karachi Pakistan
  • Lubaid Ahmed Usman Institute of Technology Karachi, Pakistan


Information Extraction; Natural Language Processing; Filtering unstructured Curriculum Vitae; Named Entity Recognition


Companies and recruitment agencies required to go through tons of Curriculum Vitae every day to find suitable candidates, which is inefficient if done manually by a recruiter. In this paper, an automatic system is proposed for the selection of best candidate.   This proposed model can take out all the vital information from the unstructured curricula vitae and transform them into the structured format. It will also allow recruiters to filter and search for only relevant data within the structured curricula vitae. This proposed model uses different techniques of data extraction, natural language processing and named entity recognition for converting unstructured information into the structured information.


Sander, P., et al., Machine Learning and Artificial Intelligence in Radiation Oncology. Natural language processing in oncology, ed. T.R. John Kang, Barry S. Rosenstein. 2023: Academic Press. 137-161.

Manogaran, G., C. Thota, and D. Lopez, Human-computer interaction with big data analytics, in Research Anthology on Big Data Analytics, Architectures, and Applications. 2022, IGI global. p. 1578-1596.

Pouyanfar, S., et al., Multimedia big data analytics: A survey. ACM computing surveys (CSUR), 2018. 51(1): p. 1-34.

Ghosh, S., S. Roy, and S.K. Bandyopadhyay, A tutorial review on Text Mining Algorithms. International Journal of Advanced Research in Computer and Communication Engineering, 2012. 1(4): p. 7.

Hirschberg, J. and C.D. Manning, Advances in natural language processing. Science, 2015. 349(6245): p. 261-266.

Collins, M. and Y. Singer. Unsupervised models for named entity classification. in SIGDAT conference on empirical methods in natural language processing and very large corpora. 1999.

Patil, N., A. Patil, and B. Pawar, Named entity recognition using conditional random fields. Procedia Computer Science, 2020. 167: p. 1181-1188.

Chiu, J.P. and E. Nichols, Named entity recognition with bidirectional LSTM-CNNs. Transactions of the association for computational linguistics, 2016. 4: p. 357-370.

Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Siefkes, C. and P. Siniakov, An overview and classification of adaptive approaches to information extraction. Journal on Data Semantics IV, 2005: p. 172-212.

Huang, C.-C., et al. Cross-lingual information to the rescue in keyword extraction. in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014.

Patil, N., et al., Candidate recruitment system by using keyword based searching. International Research Journal of Engineering and Technology, 2017. 4(3): p. 24-26.

Bourhis, P., et al. JSON: data model, query languages and schema specification. in Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems. 2017.

Introducing spacy. 2016 [cited 2024 February 19]; Available from: http://

Andor, D., et al., Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042, 2016.

Li, Y., et al., Mison: a fast JSON parser for data analytics. Proceedings of the VLDB Endowment, 2017. 10(10): p. 1118-1129.

Joshi, A.K., Natural language processing. Science, 1991. 253(5025): p. 1242-1249.

Kumar, L. and P.K. Bhatia, Text mining: concepts, process and applications. Journal of Global Research in Computer Science, 2013. 4(3): p. 36-39.

Tan, A.-H. Text mining: The state of the art and the challenges. in Proceedings of the pakdd 1999 workshop on knowledge disocovery from advanced databases. 1999. Citeseer.

Francis, L. and M. Flynn, Text Mining Handbook Casualty Actuarial Society E Forum. 2010, spring.

Gupta, V. and G.S. Lehal, A survey of text mining techniques and applications. Journal of emerging technologies in web intelligence, 2009. 1(1): p. 60-76.

Aggarwal, C.C., Mining text streams, in Mining text data. 2012, Springer. p. 297-321.

Gabriel, R., P. Gluchowski, and A. Pastwa, Data warehouse & data mining. 2009: W3l GmbH.

ChatGPT. November 30, 2022 [cited 2024 February 19]; Available from: http://

DELL.E 2. April 6, 2022 [cited 2024 February 19]; Available from: http://

Whisper. September 21, 2022 [cited 2024 February 19]; Available from: http://

Yashodhya, V.W., et al., A phrase-based questionnaire–answering approach for automatic initial frailty assessment based on clinical notes. Computers in Biology and Medicine, 2024. 170: p. 108043.

Yuyao, G., et al., Integrated modeling for retired mechanical product genes in remanufacturing: A knowledge graph-based approach. Advanced Engineering Informatics, 2024. 59: p. 102254.

Pan, Y.a.F., JinXia and Zhu, ChunTing and Li, Minda and Wu, HuiQun, Towards an Automatic Transformer to Fhir Structured Radiology Report Via Gpt-4. SSRN, 2024.

Armary, P., et al. CIAD System for Geographical Entity Detection at TextMine'24. in TextMine'24. 2024. Dijon, France.

Saout, T., F. Lardeux, and F. Saubion, An Overview of Data Extraction From Invoices. IEEE Access, 2024. 12: p. 19872-19886.

Jiang, J., Information extraction from text. Mining text data, 2012: p. 11-41.

Sanyal, S., et al., Resume parser with natural language processing. International Journal of Engineering Science, 2017. 4484.

Ma, L., Information extraction from unstructured document. 2004: University of New South Wales.

MURZABULATOV, A., The problem of resume overload during talent acquisition. 2015.

Jigyasa Nigam, S.S., Fast and Effective System for Name Entity Recognition on Big Data. International Journal of Computer Sciences and Engineering, 2015. 3(2): p. 31-35.

Khan, I.A., et al., Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words. 2015.

Iftene, A. and A. Balahur-Dobrescu. Named Entity Relation Mining using Wikipedia. in LREC. 2008.

Apache tika. 2007 [cited 2024 February 19]; Available from:

Stadermann, J., D. Jager, and U. Zernik, Hierarchical information extraction using document segmentation and optical character recognition correction. 2020, Google Patents.

Zhou, G. and J. Su. Named entity recognition using an HMM-based chunk tagger. in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002.

Huang, D., et al., Means to process hierarchical json data for use in a flat structure data system. 2017, Google Patents.

Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.

Kantardzic, M., Data mining: concepts, models, methods, and algorithms. 2011: John Wiley & Sons.