A recent study presented at the NeurIPS AI conference has highlighted significant limitations in the ability of large language models (LLMs) to handle complex historical inquiries. A team of researchers developed a benchmark named Hist-LLM to evaluate three leading LLMs—OpenAI's GPT-4, Meta's Llama, and Google's Gemini—on their proficiency in answering historical questions. The findings revealed that these models struggled with nuanced, PhD-level historical queries, raising questions about their reliability in this domain.
The research team found that while the LLMs could handle basic historical facts, they often fell short when dealing with more obscure or detailed historical information. For instance, when asked about ancient Egypt, one model incorrectly asserted that scale armor existed 1,500 years before it actually did. This tendency to extrapolate from more prominent historical data points resulted in significant inaccuracies.
Maria del Rio-Chanona, a co-author of the study, emphasized these limitations by stating:
"The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task."
The best-performing model in the study was OpenAI's GPT-4 Turbo, which achieved approximately 46% accuracy—an outcome only marginally better than random guessing. This performance underscored the models' struggles with complex historical questions. The research also identified biases in the training data of OpenAI and Llama models, particularly concerning regions such as sub-Saharan Africa.
The benchmark Hist-LLM evaluated the accuracy of the LLMs' answers against the Seshat Global History Databank, a comprehensive repository of historical knowledge named after the ancient Egyptian goddess of wisdom. Despite their advanced capabilities, LLMs demonstrated limited success in retrieving less well-known facts.
Peter Turchin, the study's lead researcher, further commented on the findings:
"Overall, while our results highlight areas where LLMs need improvement, they also underscore the potential for these models to aid in historical research."
However, Turchin cautioned that LLMs should not be viewed as a substitute for human expertise in certain domains. Maria del Rio-Chanona illustrated a key challenge faced by these models:
"If you get told A and B 100 times, and C 1 time, and then get asked a question about C, you might just remember A and B and try to extrapolate from that."
Leave a Reply