By Claire Newing, The National Archives (UK) and Phil Clegg, MirrorWeb.
At The National Archives of the UK we have been archiving the online presence of UK central government since 2003. We originally worked with suppliers to capture traditional websites which we made available for all to browse and search through the UK Government Web Archive: http://www.nationalarchives.gov.uk/webarchive/.
In the early 2010s, we recognised that the government was increasingly using social media platforms to communicate with the public. At that time Twitter, YouTube and Flickr were the most widely used platforms. We experimented with trying to capture channels using our standard Heritrix based process but with very limited success, so we embarked on a project with Internet Memory Foundation (IMF), our then web archiving service provider, to develop a custom solution.
The project was partially successful. We developed a method of capturing Tweets and YouTube videos and the associated metadata directly from APIs and providing access to them through a custom interface. We also developed a method of crawling all shortlinks in Tweets so the links resolved correctly in the archive.
Unfortunately, we were unable to find a way of capturing Flickr content. The UK Government Social Media Archive was launched in 2014. We continued to capture a small number of channels regularly until mid-July 2017 when we started working with a new supplier, MirrorWeb.
Captures in the cloud
MirrorWeb social media capture is undertaken using serverless functions running in the cloud. These functions authenticate with the social API and request the metadata content for all new posts created on the social channel since the last capture point. Each post is then stored in a database for later replay. Further serverless functions are triggered when a post object is written to the database to check if the post contains media content like images or videos. If media content is found these are added to a queue which in turn triggers another serverless function to download the media for the post. For replay the stored json objects are read back from the database and presented to the user with media objects in a similar layout to the original platform.
MirrorWeb chose to archive most social accounts daily to ensure that all new content is captured. Twitter, for example, limits the number of requests to their API in a 15-minute window and restrict the number of historic posts that can be collected so some tuning is undertaken to ensure rate limits are not exceeded and that all posts are captured.
We also increased the number of channels we were capturing and took the opportunity to redesign our custom access pages incorporating feedback from our users. Excitingly, access to images and video content embedded in Tweets was made accessible for the first time. The archive was formally re-launched in August 2018, but we always knew it could be even better!
Flickr, access, and full text search
Towards the end of 2018 we launched an improvement project with MirrorWeb. It had three key aims:
(1) To develop a method of capturing Flickr images. This became urgent as in November 2018 Flickr announced that as of early 2019 free account holders would be limited to 1000 images per account and any additional images above that number would be deleted. A survey showed that several UK government accounts, particularly older accounts which were no longer being updated, were at risk of losing some content.
(2) To further improve our custom access pages.
(3) To provide full text search across the social media collection.
The project aims were fulfilled, and the new functionality went live late in 2019.
By far the most exciting new development was the implementation of full text search. We undertook some user research earlier in 2018 which revealed that users considered the archive to be interesting but didn’t think it was very useful without search. This emphasized to us how important it was to provide such a service.
The search service was built by MirrorWeb using Elasticsearch, the same technology we use for the full text search facility on the UK Government Web Archive, our collection of archived websites. MirrorWeb once again make use of serverless functions to provide full text search of the social accounts. When social post metadata is written to the database, a serverless function is triggered to extract the relevant metadata and this is then added to Elasticsearch.
Each search queries the full text of Tweets and the descriptions and titles of YouTube and Flickr content. Users can initially search for keywords or a phrase and are then given the opportunity to filter the results by platform, channel and year of post. They can also choose to only display results which include or exclude specific words.
Additionally, we added a search box to the top of each of our custom access pages to enable users to search all data captured from a specific channel. For example, on this page a user can search the titles and descriptions of all the videos we’ve captured from Prime Minister’s Office YouTube channel.
When we started to investigate capturing social media content over a decade ago there was a feeling in some quarters that content posted on social media was ephemeral and posts were being used to point to content on traditional sites. Events of recent years demonstrated that social media is of increasing importance. In some cases, government departments and ministers announce important information on social media some time before they update a traditional website.
We have achieved a lot already, but we know there is lots more to do. In future, we aspire to add a unified search across our website and social media archives. We are aware that the API capture method does not work for all platforms, so we are actively working to find other methods of capture, particularly for Instagram and Github. We hope to find a way of displaying metadata we capture which is not currently surfaced on the access pages – for example changes to the channel thumbnail image over time.
We are also aware that are many gaps in our web archive where we were unable to capture embedded YouTube videos. We hope to develop a method of linking between those gaps and the equivalent videos held in the YouTube archive. Finally, we plan to do some user research to guide future developments. We are very proud of the UK Government Social Media Archive and we want to make sure it is used to its full potential.