Revisiting cognitive load theory : second thoughts and unaddressed questions

Recebido em: 23 jan. 2020. Aprovado em: 09 fev. 2020. Publicado em: 15 jul. 2020. Abstract: In cognitive load theory (CLT), learning is the development of cognitive schemas in a long-term memory with no known limits and can happen only if our limited working memory can process new information presented and the amount of information that does not contribute to learning is low. According to this theory, learning is optimal when instructional support is decreased going from worked examples via completion problem to autonomous problem solving and learners do not benefit from practicing retrieval with complex content. However, studies on productive failure and retrieval practice have provided clear evidence against these two guidelines. In this article, issues with CLT and research inspired by this theory, which remain largely ignored among cognitive load theorists but have likely contributed to these contradictory findings, are discussed. This article concludes that these issues should make us question the usefulness of CLT in health science education, medical education and other complex domains, and presents recommendations for both educational practice and future research on the matter.

schemas in a long-term memory with no known limits and can happen only if our limited working memory can process new information presented and the amount of information that does not contribute to learning is low. According to this theory, learning is optimal when instructional support is decreased going from worked examples via completion problem to autonomous problem solving and learners do not benefit from practicing retrieval with complex content. However, studies on productive failure and retrieval practice have provided clear evidence against these two guidelines. In this article, issues with CLT and research inspired by this theory, which remain largely ignored among cognitive load theorists but have likely contributed to these contradictory findings, are discussed. This article concludes that these issues should make us question the usefulness of CLT in health science education, medical education and other complex domains, and presents recommendations for both educational practice and future research on the matter.
can happen only if (i) information to be processed is within the narrow limits of our working memory and (ii) the amount of information that does not contribute to learning is minimised (e.g., [1][2][3][4]).
This theory has resulted in a series of guidelines for the design of instruction in the context of learning complex content or procedures (e.g., [5][6][7]), including that (1)

Key lessons (1): Retrieval practice (RP)
Research on RP has consistently demonstrated that taking a memory test (i.e., RP) not only assesses what we know but also enhances retention, an effect that is also referred to as the testing effect (e.g., [8][9][10]). Although cognitive load theorists have stated that there is no testing effect (i.e., no benefit of RP) for complex content (e.g., [11]), Karpicke and Aue [8] indicated that a key finding from RP research has been that the testing effect is alive and well for complex content, and their response to Van Gog and Sweller [11], who claimed no testing effect for complex content, neatly summarizes some of the key flaws of research inspired by CLT.
To start, a core assumption in CLT is that information to be processed imposes a load on working memory, which is also referred to as cognitive load, and that load depends on how many new elements of information (i.e., not yet stored in cognitive schemas to be retrieved from long-term memory) must be processed as well as how these elements are interrelated (element interactivity). If the total number of new elements to be processed plus their interactions exceeds the narrow limits of working memory, cognitive overload occurs. The problem with the concept of element interactivity is that it is not defined in any measurable way, and consequently, cognitive load and overload are not defined in such a way either. Besides, although element interactivity is recognised as a key factor in cognitive load, it is rarely clear how element interactivity is manipulated in experiments inspired by CLT. And in many experiments, it may not be an important factor after all (e.g., memorising isolated words or single sentences). Finally, a common pitfall in research inspired by CLT is that small sample sizes leave researchers (very) unlikely to detect differences of a practically relevant magnitude (e.g., half a standard deviation) and yet researchers erroneously interpret statistically non-significant outcomes as evidence in favour of "no difference". Using Bayesian methods, which -contrary to null hypothesis significance testing -can help researchers to establish evidence in favour of one hypothesis relative to one or several other hypotheses, Karpicke and Aue [8] indicated that their small-scale meta-analysis provides substantial evidence in favour of a small positive testing effect relative to the null hypothesis of no testing effect.

Key lessons (2): Productive failure (PF)
A second key statement from CLT is that learning is optimal when instructional support and there are studies that appear to provide some evidence in favour of that prediction (e.g., [12][13]). However, participants in these studies

3/8
worked individually, and focussed on learning rules that would not be agreed as "complex" by everyone, such as learning how to apply basic rules from probability calculus to calculate a conditional probability. Besides, studies inspired by PF have provided evidence for the notion that, at least under some conditions, initial struggle with complex content in the absence of high instructional support (i.e., worked examples, or very detailed instructions making the problem easier) can benefit learning (e.g., [14][15][16][17][18]).
Although many prominent cognitive load theorists have waived away this finding by arguing that these studies mainly focussed on "low element interactivity" material and therefore CLT and PF could equally well explain the findings, the absence of measures of element interactivity does not facilitate this argument, and the materials reported in for example [15][16][17] are not any less complex (perhaps on the contrary: somewhat more complex) than the ones used in the studies that found evidence in favour of studying worked examples before solving problems autonomously (e.g., [12][13]).
A key factor that has remained largely ignored in research inspired by CLT is learning from peers, in dyads or small groups; experiments designed from a CLT perspective have almost exclusively focussed on participants learning individually, and often so in laboratory settings in which the participants did not really have any stake in the outcome (e.g., no course in biology, programming or probability calculus coming up next). Yet, based on the literature on PF thus far, it appears that learning from peers may constitute a critical factor in PF. It is therefore surprising that most cognitive load theorists continue to dismiss the work on PF as focussing on "low element interactivity" content only, whatever that means given the lack of a clear definition and good measure of element interactivity, and that even in a recent proposal to move from CLT to collaborative CLT [19] there is no single mention of PF. Given that the apparent contradiction between findings from PF research and predictions made by CLT has been discussed at several platforms before, including by prominent cognitive load theorists (e.g., [14]), To start, a lack of prior knowledge may hinder learners to understand complex problems, how they manifest, how they can be represented in a way that we can approach and try to solve them, and/or methods to solve these problems.
Besides, when these problems are presented in an artificially (well-)structured manner, learners may not come to fully understand the nature of these problems, how they manifest, how they can be represented in a way that we can approach and try to solve them, and what methods we can use to solve these problems under what conditions. PF aims to circumvent these problems by having students generate and explore the potential and limitations of different representations of a type of problem -say Type X -and methods to solve Type X (i.e., Phase 1) to then provide them with opportunities to establish useful rules for representing and solving Type X (i.e., Phase 2).
When we design learning and practice tasks around Type X that are of an appropriate level of complexity, in a context that is challenging (though not frustrating), Phase 1 can help learners to activate and apply prior knowledge of concepts that are important to understand Type X, to draw attention to critical characteristics of concepts and Type X, to explain and elaborate these characteristics, and both Phase 1 and Phase 2 can create a safe space for students to explore, generate, make mistakes, and learn and practice with methods to approach and solve Type X.

Knowledge as static vs. as dynamic
The key notion in CLT that learning is the development of cognitive schemas in long-term memory is somehow based on the assumption that content to be learned is something static that can be captured in schemas which can then be retrieved from long-term memory. However, high-stakes settings like the ones in the previous paragraph have in common that the nature of knowledge, tasks, and problems is dynamic and ever-evolving. With the advancement of science and technology, many things learned once upon a time turn out to be less useful than expected or lose their usefulness because the nature of problems, roles and responsibilities has changed.
Apart from these high-stakes settings, let us take learning and maintaining a foreign language as an example. From personal experience, most of us can tell that grammar structures and proverbs in a foreign language once learned become rusty and may be retrieved with error (i.e., incorrect memories) if we do not (continue to) use that foreign language regularly. Using that foreign language regularly, with native or otherwise fluent speakers, provides a natural form of RP. Besides, language evolves; new words and proverbs are born, and the use of grammar structures may change with time as well, and that RP of using the language with others can help us to adapt to these changes. In this respect, knowledge is not necessarily exclusively about something "out there" for us to learn but is at least to some extent also cocreated in dialogue and conversation. Finally, we do not need to see a worked example or completion problem for any new grammar structure or proverb; in line with PF, much of it is learned while "struggling" in a conversation with others.

Definitions and poor methodological practice
As mentioned earlier, the concepts of element interactivity, cognitive load, and cognitive overload -key concepts in CLT -are poorly defined and good measures are lacking. In fact, the dominant measurement practice since 1992 has been to have participants self-report on a nine-point scale how much mental effort they invested in a task that just completed [21][22], depending on the study either once at the end of a learning and/or post-test stage or several times (i.e., repeatedly) during a learning and/or post-test stage, for instance after each of a series of tasks. This practice has persisted despite repeated critiques, including perfect confounding of measurement error, differences in tasks in which it is used, and a likely shift in participants' response from one task to the next [23]. A robust rule from psychometrics is that that single self-report items can be incredibly noisy (i.e., large measurement error) and are usually much noisier than measurements obtained from series of items on the same variable of interest. Task differences may make it difficult to compare ratings from different tasks not in the last place because our willingness to invest mental effort in a given task may well depend on how many tasks we have seen before and how much effort we invested in each of these. Finally, response shift is a real issue because our conceptions of 5/8 task complexity as well as our self-assessments of what we are capable of may change as we learn.
Newcomers in a complex topic are often poor self-assessors in that topic (e.g., [24]); this is a skill to be improved with practice. If when seeing a counterintuitive probability problem for the first time we think it is easy and therefore invest little mental effort, then learn about the solution and steps to be taken towards the solution and realise it is more difficult than anticipated, we may invest more mental effort in a second problem of the same type not because the second problem is more complex but because we now have a better appreciation of some initially "hidden" complexities or difficulties and we have become more aware of the limitations of our probability problem solving skills.
To account for a range of empirical findings that could not be explained only in terms of a general "cognitive load" or mental effort invested, cognitive load theorists introduced different types of cognitive load, some of which linking to for learning not effective load (i.e., "bad" load) some of which potentially stimulating learning (i.e., "good" load). It is beyond the scope of this article to provide a detailed review of these different types of load and how different scholars have attempted to define and measure these types of load, but this work has been done already anyway (e.g., [1-5, 20, 23]) and can be briefly summarised as follows. On the one hand, there are cognitive load theorists who state that we need three types of cognitive load: load arising from essential aspects of the task (intrinsic), load due to non-essential aspects of the task (extraneous), and load arising from the deliberate engagement in learning (germane) (e.g., [7,25]). On the other hand, there are scholars who state that germane load is that part of the intrinsic load that results in learning (i.e., not all intrinsic load results in learning); from this perspective, germane load is therefore not a third independent type of load but part of intrinsic load (e.g., [2,5,20,26]). Along with this lack of consensus in definitions, we have seen the development and use of a variety of self-report questionnaires (e.g., [12,[27][28][29][30]) which all attempt to measure two or three types of load but with somewhat different wording. Each of these questionnaires suffers from question wording effects, suffers from the same task differences and response shift issues as the mental effort self-report item, and all beg the same question: if we cannot even properly Yet, interpretations of statistically non-significant outcomes such as "there is no effect" are all over the place in CLT research.
Lehmann and Seufert asked learners to indicate their preference for either auditive or visual texts, and they found that among learners with a preference for visual text the ones given visual texts on average learned more than their peers who were given auditive texts. However, as they themselves recognise, most texts in everyday life are presented visually, so an increased closeness to real life may be a much more likely explanation for this finding than tailoring materials to learners' preferred learning styles. Furthermore, there is another potentially obvious confounder: reading skills. What if the participants who indicated a preference towards visual texts happen to have been the ones with better reading skills compared to the ones who indicated a preference towards auditive texts? When having to process information, competence and preference often go together, and if that is the case here, the ones with better reading skills may more frequently have indicated a visual preference than the ones with somewhat poorer reading skills. The finding that the presentation of visual texts on average resulted in better outcomes in the "visual preference" group than in the "auditiveambiguous preference" group may then largely if not exclusively reflect a difference in reading skills rather than a difference in (whether or not tailoring to) style per se.

To conclude
Two key statements from CLT are that (1) learning is optimal when instructional support is decreased going from worked examples via completion problems to autonomous problem solving and (2) learners do not benefit from practicing retrieval with complex content. However, research inspired by PF has provided evidence against (1) while research on RP has provided evidence against (2). An immediate recommendation for teachers and others involved in educational practice is to not consider CLT -or any educational theory for that matter -as the holy grail providing the The suggestion that CLT is a useless theory that can now be placed in the museum of dead theories is neither the message nor the intention of this article. However, to assess the continued relevance of CLT as a key contributor to educational research and practice, more cognitive load theorists should take note of critical arguments that have been made for quite a while now. Specifically, findings from research on PF and RP that contradict core predictions from CLT, critiques on the lack of definition and good measures and the lack of consensus on these questions in the cognitive load community, and recommendations for good methodological and statistical practice such as striving for larger samples and refraining from interpreting statistically non-significant findings as evidence of "no difference". If we take these points together, we may in the next years learn much more about conditions under which CLT, PF, and