New LLMs for Math Education
One of the goals of the Learning Engineering Virtual Institute is to use advances in AI to support middle school math learning.
With that aim in mind, our teams have been our work on large language models (LLMS) in math learning contexts and we have trained an early version of LLMs for anyone to use as well as made important contributions to the field, finding for instance that bigger is not always better.
To accelerate the pretraining, finetuning, and evaluation of LLMs with billions of parameters, we leverage UF’s supercomputer, HiPerGat or, where we have access to 32 A100s NVIDIA A100 (80G high bandwidth memory each) GPUs. To orchestrate the GPU cluster, we adopted Microsoft’s DeepSpeed to allow distributed training to further maximize hardware acceleration. ASSISTments’ data infrastructure, our contribution lies in two aspects: (1) Researching, training, and evaluating pre-trained LLMs tailored to K-12 math learning and (2) Exploring visual reasoning with math image responses using multimodal and multitask learning.
First, we have pre-trained models utilizing hundreds of millions of tokens derived from genuine K-12 math learning scenarios. We leverage three state-of-the-art LLMs: LLaMA 1 (7B), LLaMA 2 (7B), and GPT-J (6B). These include Q&A sessions from Math Nation and problem-response pairs from ASSISTments.
Our validation process involves datasets from ASSISTments (for multiclass prediction), the NCTE Classroom Transcript Analysis (for multilabel prediction), and Math Nation’s online Q&As (for text generation). We compared our pre-trained models, their original counterparts, and GPT-3.5 (GPT-4 does NOT allow finetuning yet). For the comparison between our pre-trained models and their counterparts (e.g., GPT-J-Pretrained vs. GPT-J), the performance gain can be 7% (LLaMA-1-Pretrained vs. LLaMA-1). if we consider the best performance of these models.
More importantly, we found that super-sized LLMs such as GPT-3.5 may not always be the best option for specific downstream tasks, where our pre-trained models can outperform GPT-3.5 by 2-3% in accuracy, suggesting our opportunities to democratize SOTA AI in education without computing supremacy. We are wrapping up our validation experiment, and we will release the full report and models once completed. You can find our preliminary models on HuggingFace at the end of this piece.
Second, we have developed LLMs that incorporate students’ visual responses, educator feedback, and grading via the ASSISTments’ (image and text) dataset. These models are designed to bolster students’ mathematical reasoning by integrating visual reasoning and diagnostic feedback. Capable of processing multimodal inputs (both images and text), they offer dual outputs (multitask): grading and feedback.
Our preliminary evaluation results show that our multimodal LLM models enhanced with semantic data (e.g., using self-instruct) and image (e.g., super-resolution, cropping, object detection) augmentation can achieve a predictive performance of over 78% accuracy for a five-label scoring task and ~45% alignment with teacher feedback for the generation task, outperforming benchmarks (ViLT and CoCa, SOTA of visual reasoning) by 3% and 4%, respectively. The GPT family (3.5 & 4.0) of OpenAI does not currently support multimodal inference, and we will further examine them if their multimodal capabilities are publicly accessible. We are working on a paper, and the full model will be released with our publication.
Please feel free to contact us If you are interested in learning more about our work.
Notes. The multitask version will be published soon with our publication.