The CACHE tool will be designed to allow researchers to do simultaneous searches of the plain text of documents and artefact metadata related to Australian indigenous history and ethnography.
The digital repository takes inspiration from other digital corpora of contact history and ethnography, particularly the University of Michigan’s The United States and its Territories, 1870–1925: The Age of Imperialism, and the South Asia Research Foundation’s South Asia Archive. Users will be able to search multiple documents simultaneously, constraining the searches by date, region or language group. One suggestion has been to use the open-source Lucene search-engine library or the University of Chicago’s Philologic as a basis.
Ultimately, the design of the repository and its interface must be driven by the kinds of research questions the users are pursuing. My own linguistic research has informed my thinking about how to build the tool and what I want to get out of it. These are some of the questions I would like to be able to address with CACHE:
- What are the etymologies of Aboriginal words in Australian English?
- How have Aboriginal social category systems been described and what changes have they undergone over the period of contact?
- What is the context for lexical innovation in Aboriginal languages, particularly with regards to naming introduced peoples, products and species?
- What kinds of interactions did Aboriginal people have with non-English-speaking foreigners prior to the 20th century?
Others would like to use the tool to address these questions:
- What are the etymologies of Aboriginal words in scientific names for Australian species? (David Nash)
Please let me know what research questions you are pursuing and your specific needs for the tool. Note that language-related questions reflect my particular research interests. The tool would be potentially even more useful for these kinds of questions:
- What were the circumstances of settler-Indigenous contact at a given time in a given area?
- To what extent can continuity of a given group be demonstrated from archival materials for the purposes of Native Title?
Scope
Funding is sought for a pilot version of CACHE, built in such a way that it can be easily expanded over time.
The pilot would consist of
- a data-entry and upload interface
- a search interface
- a search results interface
- a reading interface
As much as possible, open-source code will be used for all these functions. A programmer is needed to stitch these modules together, and a designer to create user-friendly interfaces.
IMPORTANT: If a single open-source product can do most of the functions of the four interfaces, this is preferred. I am willing to sacrifice a certain amount of functionality in favour of doing less coding with less places for bugs to creep in.
The pilot version would only include documents from the period 1800 to 1805. There are less than 40 items for this period but they cover a very broad range of source types, including works that are not digitised and artefacts that are associated with texts, e.g., the artefacts collected from the Flinders expedition (1801-1803). Thus the search interface could be tested to its full capacity before CACHE is expanded and relocated to a more permanent home. The pilot version would require less than 1GB of storage space.
1. Data-entry interface
All options will minimally require the following metadata fields to be available in the data-entry interface:
Source type (Dropdown: artefact, book, image, journal article, letter, newspaper or magazine article, police record, work of fiction)
Source identifier (location code or museum number etc in original archive; may be multiple)
Source URL (original http address of source; includes option ‘none’)
Language (Dutch, English, French, Italian, Russian, Spanish, none [artefacts])
Digitised (Dropdown: Y/N)
Date of publication/record (May add ‘ca.’)
Date of observation/event: start (In other words, the beginning of the relevant time period covered by the source. Eg, a publication of 1889 may concern events or observations from 1861-1863. May add ‘ca.’)
Date of observation/event: end
Author
Title
Place of publication
Pages (Indicates pages of relevance, especially in large volumes; may be left blank)
Region (Dropdown: ACT, NSW, NT, QLD, SA, TAS, VIC, WA, AUST; may be multiple. Obviously, state boundaries are understood according to their contemporary limits and sources relevant to the ACT may, for instance, pre-date the establishment of the ACT as a bounded region)
Coordinates: (multiple entry fields; this is for associating a source with geographic locations relevant to encounters/observations. Researching and entering this data for every source is potentially very labour intensive and is not expected to be included in the first phase of the project. Later stages may include a map interface where this information is displayed)
AUSTLANG/ISO 639-3 code: (Dropdown:; Where this is explicit or inferrable, the languages of communities mentioned in the source are recorded in this field; should include ‘Unknown’ and a checkbox ‘language is uncertain’).
Comments: (A free field for commentary on the source)
Relevant texts: (A free field for listing relevant secondary materials that comment on the source)
2. Search interface
The search interface should be as simple as possible with the option of doing and ‘advanced search’. I assume that Lucene has some kind of in-built search interface, and it’s possible that BookReader does too but I don’t know how to interpret the source code. Any open-source module will be fine.
I quite like the Trove advanced search:
However, instead of ‘Newspaper Title and Location’, there would only be the option of ticking boxes for region. Ie, ACT, National, NSW, NT, QLD, TAS, VIC, WA. Likewise, ‘Article category’ would change to ‘Source category’ with the option of selecting: artefact, book, image, journal article, letter, newspaper or magazine article, police record, work of fiction. The default search is ‘all’.
I also really like the British Museum’s advanced search feature for their collections database:
3. Search results interface
The model I have in mind is something like the The United States and Its Territories: 1870–1925. What’s great about it is that it shows you the number of results of a given search term in each returned document. You can then choose to list just the pages with this term (‘Results details’), look at the whole document starting from the title page (‘View first page’). I’ve never had any use for ‘List all pages’ which simply gives you a hypertexted list of page numbers, or ‘Add to bookbag’ which saves the source for you (better just to bookmark it yourself because the ‘bookbag’ is session specific).
Below shows the results of a basic search for the term bolo which is a kind of Philippine machete used in agriculture and war.
In this image the results are sorted by frequency, but the default is to sort them by date of publication which is more useful. For CACHE it would be even more useful to be able to sort the sources by ‘Date of observation/event’ as per the metadata above.
4. Reading interface
The Internet Archive uses an open-source reading interface called BookReader. You can find the code here.
BookReader is not perfect and searching within a document can be unreliable, but thankfully you can read the associated plain text separately and do an accurate search just by using CTRL-F. Having said all that, the large double-page reading interface is the best I’ve seen. When it comes to viewing books and pamphlets, the two-page spread preserves the integrity of the original document which was designed to be read this way. The image below shows a search for ‘shield’ in the 1896 edition of Joseph Banks’ journal. The yellow pins indicate where in the document the word ‘shield’ appears and you can click on these to go straight there.
The pilot version of CACHE should use BookReader as the reading interface, but the graphics (arrow icons etc) ought to be changed to make it appear less Internet-Archivey.
Here are some examples of how the BookReader open-source interface has been implemented by others:
Clean, snappy and elegant. Lots of black space.
Princeton University Digital Library
With minimal changes to the default BookReader graphics.
Digital collections @ Stanford
Lot’s of white space.
Special thanks are due to John Carty, Bill Gammage, Gaye Sculthorpe, Jutta Besold, David Nash, Helen Gardner and Tom Honeyman for sharing sources and giving feedback.