Government Information and Data Re-use: Job Vacancies

Here in the GCIO we’re considering the question of government information and data re-use. We want to work together with suppliers of non-personal government information, opening the vaults, releasing the butterfly from its chrysalis. To keep ourselves firmly rooted in reality we’re looking at the problem through the lens of a real-world test case: job vacancies. To test our thinking we’ve been working with www.jobs.govt.nz and the good folks who manage the NZDF vacancies page. Your mileage may vary but we’re hoping that the questions raised during this exploration may help you or anyone else considering something similar.

To chase the best and brightest, New Zealand government agencies post their job vacancies to a variety of locations: to their own web sites; to various commercial job sites; and to the all-of-government job vacancy site at www.jobs.govt.nz. As luck would have it, each of these destinations supports a slightly different set of fields. Fields that appear to share the same name and function can turn out to support different controlled value lists.

Consequently, it’s an entirely manual act for an agency to publish a job vacancy beyond the realms of its own site. Each agency is blessed with staff who carefully translate from one set of fields to another, from one value list to another. Text is meticulously copied and pasted. When a job vacancy must be pulled ahead of its closing date, agency staff delete the posting from each and every destination.

This process is time-consuming. A job vacancy may not appear in all the locations that a job seeker might reasonably expect it to. A job vacancy may look different in each of the different locations. Despite the diligence of agency staff, it’s also error-prone. Ideally [and I do mean ideally], an agency could maintain a job vacancy once, such that it would be maintained everywhere. This sounds like it might be a scenario for information and data re-use.

Can we apply information and data re-use in such a way that it uses agency resources and powers in a more efficient, appropriate and effective way? Can we use information and data re-use to transform the provision of services for job hunting New Zealanders? Can information and data re-use enhance access, responsiveness and effectiveness, improving job hunting New Zealanders’ experience of State Services? Perhaps so. [If you recognise these questions then I applaud your acquaintanceship with our development goals!]

We’re not going to examine the business case for information and data re-use any further here. We’re going to leave aside non-technical, but non-trivial considerations such as licensing, copyright, privacy, data sovereignty or the gathering of usage statistics. We’re going to ignore some of the technical issues too, such as development capability, server capacity, service reliability or availability of bandwidth. Instead, we’re going to look at the following interrelated areas:

  1. Data and Information. How might we describe what it is that we are we trying to communicate?
  2. Translation. How might information be translated from our schema to/from another related schema?
  3. Representation. How might we represent our information in the real world?
  4. Interaction. How might other parties interact with our information?

Data and Information

Spilosoma LuteaHow might we describe what it is that we are we trying to communicate? How do we choose the data, structures and controlled value lists to describe our information? In the case of a job vacancy our data might include job title, remuneration, closing date etc. Data without structure is just that: data. If we add structure, we get information. So, we must also choose a schema that captures the relationships between these items of data.

[I don't know about you, but I find it difficult to talk about a schema without representing it somehow. At this point however I'm trying to look at the problem of schema divested of representational form. A discussion of representation will come later.]

For a government agency looking to structure data, e-GIF is a good first port of call. Unfortunately, there is currently no e-GIF schema that fully describes New Zealand Government job vacancies. We might lobby for an official e-GIF schema for the representation of New Zealand government jobs. However, a new schema takes time and effort to devise. It must be agreed on by all the relevant parties, and then widely supported to become useful. Once defined it must be maintained.

We might try and make use of an existing standard. HR-XML may be used to describe job vacancies. There’s also a nascent attempt from the microformat community [and, as microformats are schemas by convention this is an example of a schema that is very hard to separate from its representation!].

We might consider making use of a schema defined by a commercial vendor. Google Base and Yahoo! DataRSS define schemas [I can't bring myself to type “schemata”] with many of the necessary elements for use in describing job vacancies. Both offer advantages for integration with the respective search engines, and either may be a good horse to back if widespread adoption is desirable.

Seek and TradeMe, two of the larger players in the New Zealand job market, both publish schema documents describing the schemas of their respective systems. The two schemas differ significantly but both are tailored to the New Zealand environment.

We might look into similar efforts by other governments around the world. In our case, it appears the UK government is looking into the use of Yahoo! DataRSS to represent job vacancies. Similarly, Tyne and Wear sought to “standardise the way councils and other public bodies communicate job vacancy information”. It may prove instructive to establish a communications channel with such organisations.

Finally, we might consider a de facto government schema as defined by jobs.govt.nz, the existing all-of-government job vacancy site. This may be a pragmatic approach given the cabinet minute requiring agencies to advertise all future vacancies seeking external applications on jobs.govt.nz.

It’s worth noting that, with no strict standard to follow, agencies currently publish their job vacancy descriptions using schemas and value lists of their own devising. The NZDF controlled value list for job title includes “Armourer” and “Munitions Expert”. The NZDF expresses a job’s location in terms of a military base (e.g. “Ohakea” or “Linton”).

Checklist: Decide on the data required to sufficiently detail what you are trying to communicate. Choose a schema that sufficiently captures the relationships between the data.

Translation

Euchelia JacobaeaeHow might information be translated from our schema to/from another related schema? In our test case, agencies want to publish job vacancies to other parties such as individual job seekers or other job sites. Jobs.govt.nz wants to aggregate job vacancies from agencies. As we have seen, both parties publish job vacancies according to different schemas. Fields may be structured differently. Fields that are optional in one schema may be required in another. Fields may be validated against different controlled value lists.

An example of structural translation occurs to/from Google Base’s representation of location. Other job vacancy schemas separate location into country, region, city/town, street and postcode fields. A Google Base job vacancy represents a location as a free text field, relying on the Google Maps API to parse and resolve the text to a geo-location.

An example of value translation occurs to/from jobs.govt.nz schema’s representation of job categories. The jobs.govt.nz schema encapsulates all finance and accounting related job categories within a single “Accounting & Finance” category. Other schemas permit subcategories such as “Accounting => Assistant”.

In both types of translation a machine can reasonably map fine-grained values to coarse-grained values (e.g. “Accounting => Assistant” to “Accounting & Finance” or “100 Molesworth Street, Wellington, New Zealand” to “Wellington”), but it cannot reasonably map coarse-grained values to fine-grained (”Accounting & Finance” to “Accounting => Assistant” or “Wellington” to “100 Molesworth Street, Wellington, New Zealand”). As a result, information is lost in translation.

To avoid loss of information in value translation both Seek and TradeMe attempt to create supersets of all known values. Maintaining a superset is a difficult proposition. Neither Seek nor TradeMe support “Armourer” or “Munitions Expert” as job titles. Neither Seek nor TradeMe support “Ohakea” or “Linton” as locations.

Checklist: Consider how your choice of data and schema relate to the choices of other parties that you hope to interact with. If necessary, revisit your choice of data and schema!

Representation

Spilosoma FuliginosaHow might we represent our information in the real world? Who consumes our information? How much of our data and schema must we communicate in order to be understood? Are the interested parties human, machine or both?

When representing information for human consumption we might prioritise appearance and readability, preferring natural language and minimising redundancy to avoid reader fatigue. For humans it may be enough to represent our job vacancy in pure text form or X/HTML. For convenience we might encapsulate our information within a feed format such as Atom or RSS.

When representing information for machine consumption we might prioritise structural regularity to aid machine parsing, perhaps at the expense of redundancy. Alternatively, we might prioritise compactness where we know we’ll be dealing with large amounts of information. For machines we might consider representing information in XML, JSON or CSV.

We might also consider a third way where we attempt to present our content in a way that satisfies both humans and machines. Through RDFa and microformats, semantic value intended for easy machine parsing may be added, invisibly, to existing human readable HTML content. [OK, in the strictest sense XML is human readable, but I doubt my kids will be reading BibleML out loud at the local Sunday School anytime soon!].

Checklist: Identify your consumer: machine, human or both? Will a single representation of your information suit everyone, or do you need to consider multiple representations?

Interaction

Arctia VillicaHow might other parties interact with our information? To communicate with our human audience we might simply advertise our job vacancies on a web site. When a user navigates to a particular job vacancy, our web server will send a copy of the web page across the internet to the user’s browser. To the user it will appear that their request was met with an immediate response.

We might offer a subscription service to inform users of new job vacancies, perhaps via Atom or RSS feeds [NZDF does exactly that]. To the user it will appear as if our server has sent a copy of the job vacancy directly to their feed reader. This is of course a sleight of hand. Behind the scenes the feed reader is periodically requesting the latest copy of our feed, alerting the user to any new content it finds.

There are a couple of things to note here:

  1. In all cases the client (i.e. the user) initiates the transaction.
  2. In all cases the client receives a copy of the job vacancy. At any time, this copy may be superseded by changes to the authoritative source.
  3. The time lag between the server despatching a copy of the authoritative source, and the client reading the copy, may vary from seconds to hours to days.

The user will be unaware of changes to the authoritative source until the next time they visit the web page, or poll the feed.

This is equally true of machine to machine communication. A search engine may take minutes, hours or days to reflect changes made to our web site, depending on the frequency of visits by the search engine’s indexer. Information obtained from any feed we create will only be as current as the last time a copy of the feed was requested. In both cases the frequency with which the client updates itself is not under our control.

So, how much of a lag is acceptable? This might depend on business drivers associated with the type of information we are dealing with. In our test case, agencies sometimes need to withdraw a job vacancy ahead of its closing date. How long might the withdrawal of a job vacancy take to propagate to secondary sources? What complications might arise from the time lag?

There are other points to consider with machine-to-machine interaction too. Humans are relatively tolerant of error; machines are not so forgiving. If a machine needs to ‘understand’ data then that data must be correct. Imagine reading a job vacancy advertising the position of “Accontant”. We might gripe [or laugh!] about the quality control but at least we can still infer what the advertiser really meant. On the other hand, a machine programmed to validate “Accontant” against a finite list of possible job titles will fail to make any sense of the nonsense. [Unless the machine has been trained like Google Search to ask “Did you mean 'Accountant'?”!] What should happen when one machine cannot understand another?

Checklist: Identify how consumers will interact with your information. Decide the location of your authoritative source. Identify how much latency is permissible between changes to your authoritative source and the reflection of these changes in secondary sources. Identify situations where failures to communicate may have consequences. Decide how failures should be dealt with.

We think that these questions have applicability to all government information and data re-use, not just to job vacancies. If you want to see your information and data take flight then we hope that this post might give you some food for thought. Perhaps you have comments or related questions that you’d be willing to share with us? We promise to reveal answers, as applied to our job vacancy test case, in a subsequent post. Thanks for reading!


Acknowledgements: I must commend the work of our summer intern Brad Taylor in cataloguing the relationships between so many of the various extant job vacancy schemas and controlled value lists. This post owes much of its existence to Brad’s hard work and searching mind. However, any errors, omissions or frailties of thought are mine.

The images of butterflies are from Europas bekannteste Schmetterlinge. Beschreibung der wichtigsten Arten und Anleitung zur Kenntnis und zum Sammeln der Schmetterlinge und Raupen (ca. 1895) by Dr. F. Nemos. These images are in the public domain, obtained from Wikimedia Commons.


Technorati Tags: , , ,

One Comment

  1. Hi Mark,

    I’m ex-RNZAF, one of the founders of ImpelHR (online recruitment management) and now work for NZ’s largest Internet company - and read your post with interest. Drop me a line if you’re interested in swapping the odd email.

    Alan

    Alan
    Posted January 21, 2009 at 7:09 pm | Permalink

Post a Comment

Please note that, in adding a comment, you will be taken to have read and agree to In Development's Terms of use.
Be constructive, keep it clean, stay on topic, no spam.

Your email is never published nor shared.