Content validity evidence for a personality inventory
LLMs-assisted psychometrics
DOI:
https://doi.org/10.15448/1980-8623.2025.1.47225Keywords:
artificial intelligence, psychological assessment, psychometricsAbstract
Large Language Models (LLMs) represent a significant advancement in Natural Language Processing (NLP). This study investigates the use of these models in gathering content-based validity evidence for a new instrument assessing the Big Five personality factors. Items for the new instrument were created by ChatGPT and semantically analyzed by Gemini, alongside items from the BFI-2 (human-created). The analysis employed item classification via prompt (simulating an expert judge) and Exploratory Factor Analysis of item embeddings (obtained via API), proposing a novel approach to psychometrics. Results showed semantic convergence for neuroticism, agreeableness, openness, and conscientiousness, but greater dispersion for extraversion items. Semantic convergence was also observed between LLM-generated and human-created items (content-convergent validity). It is concluded that LLMs show significant potential to contribute to the process of gathering content-based validity evidence.
Downloads
References
Alexandre, N. M. C., & Coluci, M. Z. O. (2011). Validade de conteúdo nos processos de construção e adaptação de instrumentos de medidas. Ciência & Saúde Coletiva, 16(7), 3061–3068. https://doi.org/10.1590/S1413-81232011000800006
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & Von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077. https://doi.org/10.3389/frai.2022.903077
Debelak, R., Koch, T. K., Aßenmacher, M., & Stachl, C. (2024). From Embeddings to Explainability: A Tutorial on Transformer-Based Text Analysis for Social and Behavioral Scientists. https://doi.org/10.31234/osf.io/bc56a
Dempsey, P. A., & Dempsey, A. D. (2000). Using Nursing Research: Process, Critical Evaluation, and Utilization (5th ed.). Lippincott Williams & Wilkins.
Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., Jones, M., Krettek-Cobb, D., Lai, L., Jones Mitchell, N., Ong, D. C., Dweck, C. S., Gross, J. J., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2, 688–701. https://doi.org/10.1038/s44159-023-00241-5
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings (arXiv:1909.00512). arXiv. http://arxiv.org/abs/1909.00512
Fitzner, K. (2007). Reliability and Validity A Quick Review. The Diabetes Educator, 33(5), 775–780. https://doi.org/10.1177/0145721707308172
Fors Connolly, F., & Johansson Sevä, I. (2021). Agreeableness, extraversion and life satisfaction: Investigating the mediating roles of social inclusion and status. Scandinavian Journal of Psychology, 62(5), 752–762. https://doi.org/10.1111/sjop.12755
Goldberg, L. R. (1990). An alternative “description of personality”: The Big-Five factor structure. Journal of Personality and Social Psychology, 59(6), 1216–1229. https://doi.org/10.1037/0022-3514.59.6.1216
Google. (2024). Gemini (Modelo models/text-embedding-004) [Large language model]. Google. https://ai.google.dev/gemini-api/docs/embeddings
Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7(3), 238–247. https://doi.org/10.1037/1040-3590.7.3.238
Hu, J., Dong, T., Gang, L., Ma, H., Zou, P., Sun, X., Guo, D., & Wang, M. (2024). PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation (Versão 2). arXiv. https://doi.org/10.48550/ARXIV.2407.05721
Hu, L., He, H., Wang, D., Zhao, Z., Shao, Y., & Nie, L. (2024). LLM vs Small Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18234–18242. https://doi.org/10.1609/aaai.v38i16.29782
Kjell, O. N. E., Kjell, K., & Schwartz, H. A. (2024). Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Research, 333, 115667. https://doi.org/10.1016/j.psychres.2023.115667
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5). https://doi.org/10.18637/jss.v028.i05
Lorenzo-Seva, U., & Ten Berge, J. M. F. (2006). Tucker’s Congruence Coefficient as a Meaningful Index of Factor Similarity. Methodology, 2(2), 57–64. https://doi.org/10.1027/1614-2241.2.2.57
McCrae, R. R., & Costa, P. T. (1997). Personality trait structure as a human universal. American Psychologist, 52(5), 509–516. https://doi.org/10.1037/0003-066X.52.5.509
Oliveira, J. P. (2019). Psychometric Properties of the Portuguese Version of the Mini-IPIP five-Factor Model Personality Scale. Current Psychology, 38(2), 432–439. https://doi.org/10.1007/s12144-017-9625-5
Ooms, J. (2014). The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects (Versão 1). arXiv. https://doi.org/10.48550/ARXIV.1403.2805
OpenAI. (2023). ChatGPT (Versão 3.5, consulta de setembro) [Large language model]. OpenAI. https://chat.openai.com
Pasquali, L. (2010). Instrumentação Psicológica: Fundamentos e Práticas. Artmed.
Pellert, M., Lechner, C. M., Wagner, C., Rammstedt, B., & Strohmaier, M. (2024). AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories. Perspectives on Psychological Science, 19(5), 808–826. https://doi.org/10.1177/17456916231214460
Pires, J. G., Nunes, C. H. S. D. S., Nunes, M. F. O., & Primi, R. (2023). Preliminary validity for the Big Five Inventory-2 in Brazilian adults. Psico-USF, 28(1), 91–102. https://doi.org/10.1590/1413-82712023280108
R Core Team. (2023). R: A Language and Environment for Statistical Computing (Vienna, Austria). R Foundation for Statistical Computing. https://www.R-project.org/
Revelle, W. (2007). psych: Procedures for Psychological, Psychometric, and Personality Research (p. 2.4.6.26) [Dataset]. https://doi.org/10.32614/CRAN.package.psych
Rizopoulos, D. (2006). ltm: An R Package for Latent Variable Modeling and Item Response Theory Analyses. Journal of Statistical Software, 17(5). https://doi.org/10.18637/jss.v017.i05
Roebianto, Roebianto, Savitri, Aulia, Suciyana, & Mubarokah. (2023). Content validity: Definition and procedure of content validation in psychological research. Testing, Psychometrics, Methodology in Applied Psychology, 30(1), 5–18. https://doi.org/10.4473/TPM30.1.1
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349
Slaney, K. (2017). Validating Psychological Constructs. Palgrave Macmillan UK. https://doi.org/10.1057/978-1-137-38523-9
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Soto, C. J., & John, O. P. (2017). The next Big Five Inventory (BFI-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power. Journal of Personality and Social Psychology, 113(1), 117–143. https://doi.org/10.1037/pspp0000096
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. 31st Conference on Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wickham, H. (2023). httr: Tools for Working with URLs and HTTP (Versão 1.4.6) [Software]. https://CRAN.R-project.org/package=httr
Zhang, J., Xu, X., Zhang, N., Liu, R., Hooi, B., & Deng, S. (2023). Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View (Versão 3). arXiv. https://doi.org/10.48550/ARXIV.2310.02124
Zhang, W., Deng, Y., Liu, B., Pan, S. J., & Bing, L. (2023). Sentiment Analysis in the Era of Large Language Models: A Reality Check (Versão 1). arXiv. https://doi.org/10.48550/ARXIV.2305.15005
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 José Maurício Haas Bueno, Ricardo Primi, Emanuel Duarte de Almeida Cordeiro, Ana Deyvis Santos Araújo Jesuíno, Monalisa Muniz, Ana Paula Porto Noronha

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.




