On the Promising Path of Making Education Effective for Every Student

Magazine: Features
On the Promising Path of Making Education Effective for Every Student

FREE CONTENT FEATURE

Traditional classroom teaching is an example of scaling education with a one-size-fits-all strategy, but modern machine learning algorithms are promised to adapt and personalize for each student. Can we succeed and what are the harms and benefits of introducing AI into the classroom?

On the Promising Path of Making Education Effective for Every Student

By Allen Nie, October 2024

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Geographic characteristics, K-12 education, Machine learning algorithms, Reinforcement learning

The San Francisco Unified School District (SFUSD) was experiencing a harsh awakening dealing with the fallout after attempting to adopt the 2021 California Mathematics Framework. Although a thoroughly compiled document with rigorous academic research, the most controversial part of this framework was about detracking—removing an honors math class in middle school for students who have progressed ahead [1, 2]. This decision drew ire from parents and was one of many issues that triggered a recall election that led to the removal of three members from the school board in 2022.

Education used to be a privilege. Apprenticeship and tutoring were the predominant means of learning. This form of teaching remains active today, for example, a Ph.D. program is in itself an apprenticeship experience where you learn by directly working with a senior mentor. Modern classroom-based education is an effort to make education accessible to the masses. Its origins can be traced back to 10th-century China and 16th-century Japan, with its modern form emerging in 18th-century Europe. Tracking, which involves separating students with different learning progresses into two "tracks," often raises concerns about its effectiveness for individual students and whether it is fair or equitable for all, remains a crude attempt to personalize classroom-style mass education.

Based on the reactions of the irritated parents in the San Francisco school district, personalization in education is viewed as a positive factor in a child's school experience. Moreover, personalizing a student's learning experience is about making decisions. At the highest level, the State Board of Education makes decisions by proposing the curriculum framework, while at the lowest level, a teacher decides how to deliver the materials to students each day.

Reinforcement learning (RL), a branch of artificial intelligence (AI), studies the problem of how to make optimal decisions. We can define an objective for the AI to maximize (via a reward function), and through a series of optimization rounds, an "optimal" solution can be discovered.

This draws an interesting question: What is the objective we should optimize regarding student learning? Although various areas of education research focus on measuring students' learning experiences in terms of enjoyment, engagement, creativity, and fulfillment, it is undeniable that the primary objective of learning is the accumulation of knowledge, which can be assessed through exams and testing. Most of my research focuses on determining the optimal sequence of decisions for each student to increase learning efficiency. Using RL algorithms and formulating the act of teaching as a decision-making problem, we can find an optimal solution for answering some of the interesting, yet basic, questions about a student's learning experience.

Personalized Tutoring through Reinforcement Learning

Learning science principles can often be too broad and too general to inform practitioners on what is needed to design engaging, effective educational experiences. RL can be especially advantageous in out-of-classroom learning environments, where motivation and engagement are crucial, or in non-traditional curricula that adopt alternative instructional methods rather than traditional lectures and practice. However, the learning sciences provide limited guidance on effectively supporting students in these contexts.

Along with Sherry Ruan, Dr. Emma Brunskill, Dr. James Landay, and a team at Stanford, we created educational software to support math concept learning for students roughly ages 9–12 that is narrative-based and pedagogical-supported. It uses RL to adaptively learn how to provide optimal responses to support student learning [3]. With this software, we can employ AI algorithms to determine how to best support a student's learning journey. RL algorithms interact with students and learn to choose an intervention (e.g., a hint or a helpful Socratic question), given the current context (e.g., an estimate of a student's knowledge), to maximize the desired outcomes such as improved test scores.

In this study, we utilized an informal online learning platform to educate students on the mathematical concept of volume. The learning activities are integrated into a narrative storyline. Based on student responses, an AI tutor selects from four common teaching strategies: offering direct hints, giving general encouragement, providing guided prompts to scaffold learning (e.g., "Have you heard of a unit cube?"), or giving passive positive acknowledgment (e.g., a smiley face). Figure 1 presents a screenshot of the web interface in use. In total, we recruited 269 elementary school students to use the RL-narrative web interface (RL condition). Additionally, 70 students used an alternative interface with no narrative or RL-based personalization (control condition). Students completed an eight-item assessment and a math anxiety survey, before using the education software. After that, they completed another assessment. We used proximal policy gradient optimization [4] to update the RL policy continuously so it could learn and adapt for each student. After that, we computed the average difference between the pre and post-test for students.

Encouragingly, students with a low initial pretest score (0–2) had a much larger exam score improvement in the RL narrative condition (Figure 2). The average improvement was 2.02 for these students (N=41), which means for a post-tutoring exam of eight questions, initially low-performing students, on average, got two more questions correct after the tutoring experience.

A key benefit of using RL to tutor students is its potential to differentiate students and give personalized instructions if doing so can improve outcomes. Therefore, we should understand what differentiation, if any, is done by the RL tutor. Our design of the RL tutor has access to three categories of features: demographic features of the learner, features about the chat content, and features about the learner's interaction and performance during learning. We used a neural network interpretability method (integrated gradient) for such analysis. Our analysis method showed the RL tutor differentiates students based on their pretest score and math anxiety level. Other features, such as how much the student was pleading for help or the sentiment of the chat messages, had little effect on how the RL tutor chose optimal teaching strategies.

These two features created an intersection of four potential student groups, but we found two of the groups to be most illuminating. One group consisted of students who scored very high on the pre-tutoring test but were also very anxious about math. Another group of students scored very low on the pre-tutoring test and were not anxious about math. These two groups are almost oxymoronic and paradoxical. How can you score very high on an exam but still be nervous about your math ability? Or how can you get a low score and be so nonchalant about math? We did not conduct further analysis, but a student's self-confidence in a STEM subject can be greatly influenced by their gender, ethnicity, and cultural background. A student's learning experience is defined by the intersection of their unique background and identities.

An RL tutor might not have learned the intersectionality framework, but it knows in order to help each student learn it needs to treat them differently. For the first group of students, as shown in Table 1, the RL tutor wants to give them direct hints as quickly as possible. Perhaps this will ease their anxiety and boost their confidence for the final exam. However, for the second group, the RL tutor will avoid giving direct answers and let them engage with the exercises and spend more effort. The interactions observed between the various features describing the student, context, and pedagogical choices can provide valuable insights for expert analysis and aid in generating future hypotheses in the field of learning sciences.

Automated Tutoring on a Global Scale

We have demonstrated personalized teaching is highly effective, particularly for initially low-performing students. However, we face a scaling challenge, not in terms of increasing the number of students (commonly known as "vertical scaling"), but in extending across various subjects and domains, which is referred to as "horizontal scaling." The RL-based optimal solution we developed is only optimal for this specific website interface and a limited set of problems. Is there any hope for a generalist algorithm that can potentially teach across different student populations and subject domains?

While 2022 was packed with significant events, one momentous highlight was the world's realization of the extraordinary capabilities of large language models (LLMs). For the first time in history, we have an AI algorithm that appears to be a generalist—a jack of all trades, master of none (as of right now). Can we use LLMs to provide personalized tutoring for students? We wanted to measure the teaching performance of such a solution.

In 2023 with Dr. Brunskill, Dr. Chris Piech, and a team of researchers at Stanford, we designed an experiment with a generous unrestricted research grant from OpenAI. We provided 5,831 students from 146 countries in a large online coding course access to GPT-4 [5]. Coding education is an area where one might expect LLM support to be particularly beneficial for two reasons. First, although they already demonstrate proficiency in a wide range of tasks, LLMs are known to have strong performance in writing computer programs. Second, it is likely programmers may be expected to use LLM-based coding support tools, such as Github Copilot or JetBrains AI, in a software engineering job.

A visual representation of important dates is provided in Figure 3. The course started on April 24^th. Random assignment of students to two groups, treatment and control, was decided on May 8th. An email was sent to students in the treatment group, letting them know they now had access to a customized free GPT-4 interface within the class. Our interface included explicit warnings about potential hallucinations from GPT-4 and re-iterated the course policy that the conversation was being monitored, and students were forbidden from using GPT-4 to solve the homework problem directly. Students in the control group did not get the email. An optional diagnostic exam was administered between May 24–26. In contrast to concerns that students may overuse LLMs, we found only a small number of students (14.2%) who were provided access to our course GPT interface, and emailed to alert them to this opportunity, chose to use our GPT interface.

As part of the course, all students were offered an opportunity to take an optional diagnostic exam in a three-day period. We found giving students access to and advertising GPT-4 led to a substantial, statistically significant decrease in their exam participation. The effect of access to GPT was negative, 4.3 percentage points. Access and advertising also led to a decrease in homework participation and section attendance. This engagement decrease caused concern. However, this trend was not the same across countries. We show such a difference in Figure 4. Students' exam participation rate increased by 14.8 percentage points when they came from countries with low Human Development Index (HDI) scores. HDI is computed and published by The United Nations. It is a measure that summarizes the average health, knowledge(schooling and education), and standard of living for people living in that country. HDI is often used as a metric that describes the level of development for countries (less versus more developed countries).

There were 325 students from 16 low HDI countries in Asia (Afghanistan, Pakistan, and Yemen), North America (Haiti), and Africa (Benin, Ethiopia, Gambia, Madagascar, Mali, Mozambique, Nigeria, Rwanda, Senegal, Sudan, Tanzania, and Uganda). These results are especially encouraging as there is an ongoing effort in the global education community to use GPT-4 to boost educational resources in underserved areas due to unrest, war, poverty, or extreme events.

The low adoption rate overall and the fact that students could freely choose to take the diagnostic exam introduced complexity in understanding the effect of using LLMs for learning. In Figure 5, we show that in the experiment group, there were students who used GPT-4 (marked as red circles) and students who didn't (marked as blue crosses). This complicated our analysis because a simple difference between the two groups no longer reveals the true impact of using GPT-4 on learning.

We used a causal inference estimator proposed by the 2021 Nobel Prize for Economics winner Guido Imbens' seminal work [6], where he argued the effect for students who chose to use it can be estimated if certain conditions can be met. This estimator helped us understand whether there could be a potential positive benefit on learning outcomes, as evaluated by exam scores. Our estimate suggests by providing GPT-4 to students, the students who are adopters could see a 6.8 average percentage point increase in their exam scores over what they would have achieved without using GPT-4.

The Mixed Effect of an AI Tutor: Disengagement for All But Benefits for the Adopters

The SFUSD board members were not evil, malicious villains who wanted to hurt the students in their district. Tracking in mathematics can exacerbate inequities by allocating pedagogical resources to some students while denying them to others. Tracking could potentially entrench and perpetuate the deep inequality that already exists in our society, particularly along gender and racial lines.

Our study of introducing GPT-4 to coding beginners showed a mixed result. The mere act of introducing AI as a tutor or a potential learning resource puts AI in a position of power relative to the students, telling the students that AI knows more than you on this subject matter. Unlike other assistive educational technologies, AI seems to be able to cause real concerns among students. Ideally, the framing of AI should be inconsequential. However, our findings of decreased student engagement in homework, section attendance, and exam participation indicate that there is an effect of discouragement.

Can we use LLMs to provide personalized tutoring for students? We wanted to measure the teaching performance of such a solution.

We also showed the potential benefit of AI on student learning (measured by the exam score) for students who are willing to use the interface (adopters), which leads to a separate concern. The learning benefit between adopters and non-adopters can indeed widen the performance gap when students choose not to use a certain resource simply because they don't know how or their background disadvantages them from utilizing the resource effectively. This might exacerbate educational inequality, an important direction for further data analysis and research.

The solution is not to force every student to interact with the AI-based learning resource. We should create a smooth scaffolding for students to see the benefit of interacting with an AI tutor—creating an experience that shows students how to best take advantage of the tutor. Learning from the AI tutor is not just about getting a solution to a homework problem. An AI tutor can provide endless variations of practice problems, can utter different kinds of hints, can be interrogated ("Why do you solve it this way?"), and can work with the student to come up with a solution together. AI research in education must find a way to introduce AI-based technology to classrooms where all students benefit.

The Future of Effective Education for Every Student

As AI matures and enters the production-ready stage, using it to create software that benefits society overall should be a top priority. Among the many areas where AI is poised to drive societal-level transformation, AI-enhanced education could enable current and future generations to learn, achieve, and self-actualize. This urgency stems not only from the widespread underfunding of school districts and the high rates of teacher attrition, but also because the lack of education individualized for each student is no longer justifiable in light of advancements in AI. Creating software that can provide just-in-time, ondemand, personalized help tailored to every student's personality, learning style, and unique background is within reach. However, we must also recognize that when used in educational settings, these systems might spread harm. With more careful and pioneering research to come, the prospect of enhancing educational fairness and equity for every student is more promising than ever.

References

[1] Conrad, B. California's math misadventure is about to go national. The Atlantic. October 2, 2023; https://www.theatlantic.com/ideas/archive/2023/10/california-math-framework-algebra/675509.

[2] Loveless, T. San Francisco's detracking experiment. Education Next. March 29, 2022; https://www.educationnext.org/san-franciscos-detracking-experiment.

[3] Ruan, S. et al. Reinforcement learning tutor better supported lower performers in a math task. Machine Learnning 113 (2024); https://doi.org/10.1007/s10994-023-06423-9.

[4] Schulman, J. et al. Proximal policy optimization algorithms. 2017. arXiv preprint arXiv:1707.06347.

[5] Nie, A. et al. The GPT surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters' exam performances. OSF Preprints qy8zd. Center for Open Science. 2024.

[6] Imbens, G. W. and Angrist, J. D. Identification and estimation of local average treatment effects. Econometrica 62, 2 (1994), 467–475.

Author

Allen Nie is a Ph.D. student in computer science at Stanford University, advised by Professor Emma Brunskill and Chris Piech. His research area includes offline reinforcement learning and causal inference with applications in education. He interned at Microsoft Research and Google DeepMind. His Ph.D. is supported by a Yee-Hoffman grant from the Stanford Human-Centered AI Institute (HAI). He has published papers in NeurIPS, ICLR, AAAI, ACL, EMNLP, MLJ, and education conferences such as LAK and AIED.

Figures

Figure 1. Shown is the overall pipeline of the tutoring platform. A child learns to solve a series of mathematical problems. Whenever they struggle, they can choose to talk to a chatbot who's acting as a tutor. Before and after tutoring, the student will take an exam to measure their prior knowledge of the domain and test how much more knowledge they gained after interacting with the tutorbot.