RoboDoc

Monitoring UK government websites for their use of open and proprietary document formats.

Introduction

In 2014 the UK government announced that open standards would be used for sharing government documents.

Unfortunately, not all files on UK government websites are published in an open format yet, some are still published in a proprietary format. The RoboDoc project focuses on editable documents that should be in the open format called ODF. It aims to crawl the UK government websites and make a list of editable documents and record whether they are in an open or proprietary format. Files in the ODF format end in odt, ods or odp. Files in a proprietary format commonly end in doc, docx, xls and xlsx.

News

2018-03-25
I'm still plugging away at getting https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/686899/Country_returns_guide_-_March_2018.xls published in ODF. Here's my latest tweet:
@ukhomeoffice The spreadsheet on https://www.gov.uk/government/publications/country-returns-guide is in a proprietary format. Can you publish it in an open format instead please? Thanks! https://tlocke.github.io/robodoc/
I thought it would be good to thank people when they had published docs in an open format, rather than complaining all the time, so I tweeted this:
@ukhomeoffice @gdsteam Thanks for publishing the spreadsheet at https://www.gov.uk/government/publications/home-office-august-2017 in an open format! https://tlocke.github.io/robodoc/
2018-03-23

I did another run on 17th March, and the results are available for download. This time it looked at 80,000 pages and it showed that 21% of documents are in an open format. Here's a graph of the results broken down by year of publication:

2018-03-17

The latest run of the RoboDoc crawler tells us that of the documents published in 2018, only 12% were in an open format. Now that we've got some reasonable data, the next question is how to use it to increase that percentage. I tweeted about this, and here's the conversation:

@OpenForumEurope only 14% of editable docs on http://www.gov.uk are in an open format https://tlocke.github.io/robodoc/ not sure where to go from here


Maël Brunet @MaelB
Mar 6
Replying to @t_locke @OpenForumEurope

Cool project ! We did something a while ago at EU level http://fixmydocuments.eu/ the figures we found were even lower than yours in the UK.


Maël Brunet
@MaelB
Mar 6

Also unsure what can be done to improve. We tried a bit of name & shame + contacting the relevant services but there's just so many documents out there. Unfortunately I think this is seen as a very low priority.

So, I've decided to pick a document that was published in a proprietary in 2018: https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/686899/Country_returns_guide_-_March_2018.xls and contact the government about that single document. On the page that links to this document, there's a link called 'Is there anything wrong with this page?'. Clicking on the link reveals a form with two fields. Here's what I submitted:

What were you doing?
Clicking on the link https://www.gov.uk/government/publications/country-returns-guide on the page https://www.gov.uk/government/publications/country-returns-guide.
What went wrong
The spreadsheet is in a proprietary format, but as I understand it, government policy is to publish spreadsheets in the open format ODF:

https://www.gov.uk/government/news/open-document-formats-selected-to-meet-user-needs

Thanks for your help,

Tony Locke
tlocke@tlocke.org.uk

I'll post an update if I get a reply.

Data

results_2018-03-17.ods
The run was performed on 80,000 pages of http://www.gov.uk/. It found 3,875 editable documents, only 21% of which are in an open format.
results_2018-02-27.ods
The run was performed on 40,000 pages of http://www.gov.uk/. It found 332 editable documents, only 14% of which are in an open format. Breaking it down by year of publication gives the following:
2012 2013 2014 2015 2016 2017 2018
0% 0% 2% 10% 12% 28% 12%
problems_2018-02-19.csv
The run was performed on 20,000 pages of http://www.gov.uk/. It lists 2342 files on the site that end in either .xls, .xlsx, .doc or .docx. The titles of the CSV are:
problems_2018-02-18.csv
The run was performed on 10,000 pages of http://www.gov.uk/. It lists all files on the site that end in either .xls, .xlsx, .doc or .docx. The titles of the CSV are:

The spreadsheets are licensed under a Creative Commons Attribution 4.0 International License.

Why Open Standards?

If a document is written in an open format, anyone can freely write editing and viewing software for the format. This means that users aren't beholden to a single vendor of software. In contrast, if a document is written in a proprietary format, then the user can be locked-in to using software from a single vendor.

For example, web pages are written in the open format HTML, which means there's a good choice of web browser (eg. Microsoft Internet Explorer, Mozilla Firefox, Apple Safari or Google Chrome).

A WikiBook on open standards has been written which includes a section on the benefits of open standards.

RoboDoc Software

The RoboDoc code is on GitHub at https://github.com/tlocke/robodoc. Suggestions, contributions or bug reports are very welcome. Please open a new issue on GitHub.