Development

Approaches to classification

The DCD archive has a very large number of artefacts (~500,000), relatively few of which (4,300) have been classified and catalogued in detail. 

How can we catalogue the rest of the archive as quickly, accurately, and completely as possible? 

I like to characterise the two extremes as vertical and horizontal:

  • vertical is where we have detailed, accurate data on a few items; 
  • horizontal is where we have little or no data about very large numbers of items.

Ideally we want the best of both worlds – lots of detail about lots of items – but that’s an expensive proposition, and we need to find a best-value approach to getting as much data as we can for the lowest cost, in terms of both money and resources (e.g. people’s time, physical space).

Picking your (vertical) battles

A big reason that the vertical approach is so slow is that the data entry requirements are often onerous and poorly presented. There’s little more off-putting than seeing a screen filled with a hundred freeform entry fields you feel compelled to populate, and when input formats are not enforced it’s horribly error-prone, which is apparent when browsing through the data we’ve got.

Complex, rigid taxonomies make searching and classification very efficient, but they are not people-friendly, so again it would be better if we can streamline and automate as much of this classification as possible to reduce the workload of curators.

It’s also a good idea to break up data entry – it’s not necessary to enter all the data relating to an artifact at once; it may be more efficient to enter a limited set of fields in a task-specific workflow.

Say for example you wanted to enter the physical dimensions of a set of paintings: it would be much easier to use a workflow that focused on that task rather than also requiring you to know the date of production, artist, and whatever other data it might want – those things can be completed in other flows by other processes or people, and it all heads in the same direction of raising quality across the archive.

Any data that’s missing can be found by appropriate searching, so we can find and measure what we haven’t got, not just what we have.

Raising the (horizontal) bar

As we discussed elsewhere, machine learning offers some excellent ways to conjure useful, reasonable metadata about large numbers of items at very low cost. This is very much in the horizontal style, and gives us a huge improvement in the database; this is a technology that has only recently become available, and the benefits and applications are enormous.

However, as that blog discusses, the data will almost inevitably suffer from being incomplete, inaccurate, and biased, though that doesn’t obviate its usefulness – it still provides a huge step up from the “nothing” that it replaces, and you don’t need to rely on people, time, or significant outlay to benefit from it.

Apply the vertical to the horizontal

So how can we leverage these amazing machine learning capabilities while retaining the level of quality we are looking for?

Steve Jobs described computers as being like “a bicycle for the mind”, in that we can use them to amplify our own abilities.

The people that are capable of understanding what individual images mean, and are thus capable of populating their metadata more accurately and completely, are also able to help to build a training model that can help classify other images.

Time spent manually classifying a dance genre across a few hundred images may be enough to train a model to do the same to 50,000 other images, providing an enormous amplification of an individual’s abilities and knowledge. This process can be repeated across any number of facets of the knowledge we want to capture in the archive.

Some express concern about letting models loose on real data, but that can be accommodated by keeping auto-generated data separate from manually entered or curated/vetted data, and applying similar classification or crowdsourcing techniques to the auto-generated data.

For example, get a model to generate a selection of tags, and then get a human volunteer to vet the classifications before promoting them to “real” tag status.

This is exactly the kind of problem that a Zooniverse-type system would work well for.