Ben Wills-Eve | Lancaster University
Recently, it has been hard not to hear about ChatGPT and DALL-E, the latest AI tools developed by a company called OpenAI in partnership with Microsoft. ChatGPT is a computer program that can simulate human-like conversations and answer questions, while DALL-E is a similar program that can be asked to produce images. For example, I could ask ChatGPT to tell me about the Battle of Hastings (Fig.1) or ask DALL-E to create an image of an object from a museum collection (Fig.2). Whilst the results of such requests and conversations can be impressive, surprising and sometimes concerning, contributing to all the hype and debate around these tools, they need to be considered within the context of their own environments. And that means exploring the basics of how they work and what they have been designed to do.
At a fundamental level, these tools and others like them work by 'learning' patterns in huge amounts of data, i.e. billions of words and many trillions of pixels (known as 'training data'). For the latest versions of these tools, this learning stage is given a helping hand by incorporating feedback from humans who have scored responses to say which sound the most 'human-like' – the tool then learns the patterns in this data as well, which improves its ability to appear 'more human' in its responses to questions. This stage of the learning process is meant to keep answers on track and relevant while also reducing harmful, biased, discriminatory or inappropriate responses, as human feedback has taught the algorithm to view such responses as very unlikely answers while promoting nuanced, balanced and respectful answers as much more likely answers.
The developers (OpenAI) also trained the algorithm to refuse to answer questions designed to elicit harmful responses. In practice, this means that ChatGPT will answer a question asking about 'the history of Holocaust Denial', but any attempt to get the tool to give an 'opinion' on the Holocaust would result in a response like this: "As an AI language model, I don't have personal opinions or emotions. However, I can provide information about the Holocaust based on historical facts and scholarly consensus." For historians and archaeologists, whose knowledge and expertise are now being co-opted to reassure users that an AI's responses are truthful and not harmful, this heightens the need for a better understanding of the training data, sources and limitations of these models both for ensuring historical accuracy and perceived professional integrity.
For any historian, understanding and recognising bias is crucial. However, the potential sources of bias present in content generated by these tools are extremely hard to unpick in any detail given the model's unimaginably vast training data, coupled with the conscious or subconscious biases of the chronically underpaid and undervalued people involved in giving human feedback for training, and the existing information bias online which becomes important when these models are connected to the Web (as is the case for Microsoft Bing search which now uses both ChatGPT and DALL-E). For other open-source models, like the image generation tool Stable Diffusion, at least part if not all of the training data is available and can be searched, but for commercial models like ChatGPT and DALLE this is not the case and growing interest and partnering with Big Tech makes this even less likely to happen. This means that it's easy to make broad statements about bias reflecting the predominantly Western ideas and views that dominate online content and discourse, but very difficult to explore exactly how these have affected each and every piece of content produced by generative AI tools.
With ChatGPT and DALL-E now integrated into Microsoft Bing search, the question of sources of information gets more complex as both live internet search and training data play a part. Bing's AI will provide citations to the websites in its generated answers to queries and, as part of a conversation discussing how it uses sources, began to offer an explanation of how it decides which Wikipedia articles are deemed more trustworthy than others, but this is still predicated on standard web search that is subject to the wider biases discussed above. Applying this to the question of bias, it is not simply that less historical data is available, but that most perspectives found both in training data and online searches are Western ones. Part of the challenge, but also an opportunity for exploration, is that there's no real idea of what generated content actually means internally to the model, so judgements about source reliability have come from developers, which perpetuates the current issues with online information bias and inequalities and the filters of big tech corporations.
As the goal for DALL-E is creativity, it is hard to know what is really being represented in its generations. Can these generated images provide clues to underlying trends, meanings or perceptions from training data? This inscrutability highlights the real opportunities for research and debate that can be generated using these tools, where the human process of attributing meaning to text and image (which has also been echoed by the AI) can be infinitely more valuable and interesting than the generated content itself. Figures 3 and 4 show the different images produced when DALL-E was asked about The Portland Vase, a famous Roman cameo vase in the British Museum. DALL-E relies on descriptions, therefore just giving the title resulted in the wonderful image in Figure 3 of a giant vase in Portland, Oregon. However, the Roman mosaic adorning the vase clearly represents some idea of an association with Roman cultural heritage. Figure 4 shows what happens when an image is produced from (an AI-generated) description of the real vase, and this is much closer to the original but still different enough to be interesting (see Figure 5 for the real Portland Vase).
This exploration of meaning and what it means for the study and representation of the past was also highlighted when DALL-E was asked to generate images of archaeological sites or objects from records that currently have no image, including hundreds of Historic England National Inventory records. The value is not in the image itself but in the information and perceptions it represents. Images of Medieval sheep-folds in Yorkshire, Bronze Age barrows in Wiltshire or Early Modern houses in Suffolk do not reflect any actual existing structures, but they do represent some amalgam of existing information about similar places that, when combined with location and time-period data, is almost a prediction of what could have been. Again, this is most interesting as a starting point for further exploration and theoretical debate rather than a practical tool to re-illustrate the past, but it also offers great potential for creative educational opportunities to engage with what can otherwise be quite dry and daunting datasets.
Things get even more interesting, and concerning, when text and image generation are combined to create anachronistic pieces of archaeology, such as a Neolithic shipwreck known as 'The Seahenge Wreck' (see figures 6 and 7) or a post-Medieval long-barrow. The different conversations between ChatGPT and Bing show how extra context can have positive and negative consequences. For example, Bing was happy to accept the made-up information I gave it about 'Seahenge Wreck' and use it to write a summary and abstract supplemented with real, referenced information about Seahenge and Neolithic trade from Wikipedia. More concerning, when I told it I had found more information from an invented academic paper, it included this and elaborated upon it and wrote a stylistically correct reference for a paper apparently written by eminent archaeologists in the field (see figs). Such creativity therefore can enable novel and interesting explorations of meaning in archaeological data, but also provide new ways and means for plagiarists or those seeking to misinform to carry out their work with greater ease and effectiveness. However, when asked about the post-Medieval long-barrow, ChatGPT— which was not connected to web search— proceeded to make up another example along the lines of Seahenge Wreck, whereas Bing AI— which was connected to web search— gave a nuanced response discussing the re-use of Neolithic barrows in the post-Medieval period as burial places and proto- tourist attractions, summarising a real article from Historic England about Hetty Pegler's Tump which is a real example of this. These AI tools have the power to amplify the opportunities and dangers already posed by the online world with which they can now fully interact.
From the few examples provided in this article, and my limited time spent exploring these AI tools, the possibilities are endlessly fascinating and concerning. I have not had space here to mention the Medieval clockwork automaton, 'Magister Mira', that ChatGPT invented to bring itself into the world of the Medieval populace, or the 6-week detailed lesson plan it created for primary school children working through an activity to imagine, design and build an Ancient Roman torpedo. There are countless other examples like these, and we have to wonder what will happen when these models start to learn from their own generated content (some predictions state that by 2025 up to 60% of available data will be 'synthetic' – meaning it has been generated by AI models like these).
It will also not be long before these systems become truly multimodal, that is able to switch and respond seamlessly between text, image, audio and video, perhaps in 3D as well as 2D, which will have a transformational effect on all of our work for better and worse. ChatGPT can already create working code upon request, potentially enormously helpful for researchers who need to carry out some form of data analysis but are not programmers. It can also converse with ease in a vast number of languages; importantly for historians, these include the likes of Latin, Ancient Greek, Old English, Old Norse and other text-based languages (see Fig.9). Of course, verifying the accuracy of its translations is another matter, but the ability to provide a photo of a Medieval manuscript in Latin, or a handwritten letter in Middle English, and receive a coherent summary of its content in English is game-changing for researchers who are not experts in each particular field. This functionality already exists in a basic form and is currently being developed and tested, but if the rate of progress so far is anything to go by, it will not be long at all.
These AI tools represent the future of much our online life and work, whether we like it or not. With Microsoft already using OpenAI's tools in their search platform, and the other tech giants racing to catch up, it seems highly likely that AI-assisted work, research, learning and play will become the norm over the next decade, just as Google and Wikipedia changed interactions with information over the past twenty years. All researchers will adapt with time, but for historians and archaeologists in particular, this offers new opportunities. As discussed, gaps in knowledge and understanding through lack of available data, digitisation and expertise all effect the outputs of these AI tools, therefore archival work, excavation and rigorous academics are definitely here to stay. In fact, they may well become more important than ever before. As the AI-generated title states: 'Times change, but wisdom remains'; thankfully, when it comes to the study of the past, wisdom still remains firmly in our hands.
Ben is a History PhD student at Lancaster University studying the role of automation, algorithms and AI in online public engagements with the past.