Quality Assurance in Web Archives: How to Automate Your Work with Command Line

By Kourosh Hassan Feissali, Web Archivist on the National Archives, UK

Both ‘automation’ and ‘command line’ can sound daunting to non-programmers. But in this post I’m going to describe how easy it is for non-programmers to use baby steps to automate time consuming tasks, or parts of a big task.

Why Use Command Line?

I admit that I’m new to command line myself but the more I use it the more I am amazed by its efficiency and by all the things it can do. Here are some of the reasons for using command line:

  • You don’t need to be a programmer to write commands. Once you learn the basics you can just copy and paste commands.
  • A Command-Line Interface (CLI) is pre-installed on your computer for free!
  • You can copy useful commands from the Internet and simply paste them in your CLI.
  • You write a command for a task once and you use it as many times as you want without having to think about the steps involved. This will save a lot of time.
  • You can write one tiny command to automate step 1 of a bigger task with 10 steps. Then, if you want, you can add a second command to automate step 2. Therefore, you don’t have to write an entire programme.
  • Spreadsheet or text editors often struggle with very large files but this is not a problem in a CLI.

Case Study: Quality Assurance of Brexit Sites

At the UK Government Web Archive (UKGWA) we crawled a large number of Brexit-related websites. Due to the nature of the project it was essential to carry out enhanced quality assurance (QA) on these websites. One technique that we used was checking the logs of the web crawler. The problem was that crawl logs can have millions of lines and there are very few applications on the market that can easily handle these huge log files. Further, some of these apps only work on one operating system (OS) but we use multiple OS’s in the team.

To illustrate how simple it is to speed up a multi-stage task I’m going to break down our enhanced QA into smaller tasks here and use some basic commands to drastically speed up the process.

The Steps

  1. Download all the log files.
  2. Merge them into one.
  3. Sort the lines by server response code.
  4. Remove all the URLs where the server error begins with 404, 2, 3, or blank space.
  5. Save the remaining URLs that begin with server errors 500, 403, etc. into a new file.
  6. Remove duplicate URLs and save in a new file.
  7. Check the remaining URLs against the live site.
  8. Ignore the ones that are broken on the live site.
  9. Copy the ones that work correctly on the live site and save as a patch-list.
  10. Clean up Downloads folder.

The process above is a little longer than this but I’ve omitted some of the steps for the purposes of this blog post. As you see, we’ve broken down one fairly complex job into 10 simple steps that are easy to understand and easy to tackle on their own. Some of the steps above are quite simple but when you’re dealing with very large files, they can freeze your computer if you use generic applications such as MS Excel. Here, I’ll describe how we can use CLI for some of the above steps.

Step 2: cat *.log >> final.log

Step 4: sed -i “” ‘s+^404.*++g’ sorted.txt; sed -i “” ‘s+^[2-3].*++g’ sorted.txt;

Step 5: cp sorted.txt sorted_errors.txt

Step 6: cat sorted_errors.txt | sort -u > sorted_errors_dedup.txt

Step 10: rm *.log

What’s great about CLI is that you don’t have to learn a whole new language before seeing the result of your work. As you see in Step 2 above the ‘cat’ command concatenates multiple files into one. No matter how many files you have or how large they are. MS Excel can give you a really hard time for this simple step but this one command concatenates any file that ends with ‘.log’ in your Downloads folder into one file in a blink of an eye. Automating with command line brings a lot of joy to your work life!


More on this topic:
How to automate web archiving quality assurance without a programmer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s