Huygens/CLARIAH Geovistory pilot

2 posts / 0 new
Last post
lpetram
Huygens/CLARIAH Geovistory pilot

 

Short recap of what we discussed in Leipzig

  • We will work together on modeling and linking the data we selected for the Huygens/CLARIAH-Geovistory pilot, to be presented on 20 June 2019 in Amsterdam
  • The aim is not to do everything perfectly, but to show 1) the potential of Geovistory, and 2) the value of data modeling done properly 
  • We will join efforts with Pierre Vernus, as his data is broadly related to the Huygens data

 

Some (additional) explanation of the data

We at the Huygens Institute really like the way in which Geovistory works with a shared repository of entities, from which various projects can ‘pick and choose’ and thus create their own scope of the data. Within the three structured datasets we selected, there are persons (or ‘Actors’ in CIDOC-CRM-speak), ships and commodities (or ’Things’), events and places. It would be great if we could model these basic entities (perhaps re-using models developed by George Bruseker — see below, under Action points) and go some way in interconnecting them. A second major Geovistory feature we’d like to showcase is how it allows users to extract structured data from texts, and thus at the same time add structure to these texts. This is where the fourth collection (the letters compiled in and sent from the Dutch East Indies) comes into play.

 

Structured data, basics — grouped per entity

Let’s start with the ships:

  • We have a complete list of ships of the Dutch East India Company (1855 records), with info on yard that built the ship and tonnage. Dataset: Dutch-Asiatic Shipping (DAS). Ships have unique ids.

 

Persons:

Important note to start with: the data on persons in these sources are for the most part not yet disambiguated. We have observations of persons who were on board of a certain ship during a certain journey. We have multiple observations of the same persons, but most observations have not yet been resolved to individuals.

  • We have data on the persons on board of some of the journeys between Europe and Asia
    • Names of the masters on all voyages from Europe to Asia and back. Dataset: DAS.
    • Names, places of birth and details of employment of crew on selection of outward and homebound voyages (c. 800,000 records). Links with ships established through voyage-id. Additional links for masters with data on masters from previous bullet through master-id. Dataset: VOC-opvarenden (VOCOPV).

 

Cargo:

  • We have data on type and value of cargo of a selection of 18th-c intercontinental and intra-Asiatic voyages. Not yet aligned with a commodities taxonomy. Links with voyages (and ships) established through voyage-id. Dataset: Bookkeeper-General (BGB).

 

Events:

  • We have data on the journeys the ships made between Europe and Asia (dates and places of departure / arrival available). Links with ships established through ’shipid'. The voyages have their own unique ids. Dataset: DAS.
  • For the 18th c, we have additional data on some of these intercontinental journeys. We also have data on journeys within Asia (dates and places of departure / arrival available). Links with previous bullet established though voyage-id. Dataset:  (BGB).
  • We have work records for crew members on voyages from Europe to Asia (start and end data of employment, rank). Dataset: VOCOPV.

 

Locations:

  • We have standardized and geo-referenced locations of yards where ships were built (reconciled to either GeoNames or Wikidata). Dataset: DAS.
  • We have standardized and geo-referenced locations of places of departure and arrival of voyages from Europe to Asia (reconciled to either GeoNames or Wikidata). Dataset: DAS.
  • We have locations of places of departure and arrival of intra-Asiatic voyages; as yet neither standardized nor geo-referenced. Dataset: BGB.
  • We have locations of places of birth of crew members (c. 800,000 observations; 150,000+ unique toponym attestations; c. 35,000 unique attestations / 640,000 observations have been standardized, geo-referenced and reconciled to either GeoNames or Wikidata). Dataset: VOCOPV.

 

Unstructured data

There are many references to the entities from the structured data in the fourth data collection, the Official letters of the VOC. We have as yet not established links between the mentionings in the letters and the structured data; the NER-output in the hOCR-files was very preliminary. It would be great if we could show how Geovistory allows users to explore (already established) links between textual sources and records of entities and to establish new links.

 

 

Action and discussion points

  • Action: Contact George Bruseker for information on data modeling done within SeaLiT project (http://www.sealitproject.eu) and Swiss data aggregation project, where a number of core entities of historical datasets were defined (George developed data model under the conceptual umbrella of CIDOC CRM). (assigned to Lodewijk; e-mail sent on 11 April 2019; awaiting response).
  • Discussion: data format of structured data
    • I pointed you to Druid triple store, where the structured data sources are available. It’s important to note that the data were converted into rdf within a very short time span and with a specific goal in mind (the data story on https://stories.datalegend.net/netwerk-maritieme-bronnen/). The structure is far from perfect and some values are missing. 
    • My suggestion would be to start the ingest into the Geovistory backend from the original files: Excel / csv for DAS and VOCOPV, MySQL for BGB. Given the scope of this pilot, I would suggest to focus on the basic properties of the main entities described above (the datasets contain additional data, which would be nice to have of course, but only if time permits). 
    • How do you think about this?
  • Discussion: data format of text
    • I provided the Official letters as hOCR files. The boxes around the individual words point to their location on the scans of the book volumes in which the letters were published on our IIIF server (cf. images available through this url: https://beta.resources.huygens.knaw.nl/resourcesorb/?categories[0][0]=Generale%20missiven).
    • Is this workable for you, or would you prefer another format (plain txt, tiff images — you could perhaps load these directly from our IIIF-server)?
  • Discussion: linkage
    • As said before, NER-output of the Official letters is very rudimentary. However, since we have indices to these books, we could provide better NER (e.g. for a number of pages or, depending on the availability of specialists, one volume). We could also (manually) link a number of persons / ship’s names from the collections of entities with the Official letters. This would allow us to showcase the integration of text and structured entities in Geovistory. 
    • We have done quite a bit of work on resolving the person observations in VOCOPV to individuals, resulting in a set of c. 50,000 individuals who sailed to the East Indies more than once. We cannot yet publish this whole dataset (as it is part of an ongoing research project), but we could use a selection. This could then be used to showcase how Geovistory builds entities from observations/statements. We could also show how Geovistory allows users to find complementary observations. 
lpetram
Data explanation

As promised, I’ve made available the Huygens datasets in CSV format. You can access the data folder through this link (password sent by e-mail). In the following, I will give a short explanation of the files and their columns. This diagram shows how the files are interrelated:

 

 

File explanation:

  1. vocop: c. 800,000 employment records of VOC crew members, normalized, with references to DAS (see 2.) and vocPlaces (see 3.); also included: vocop_careers.csv: 7614 clusters of VOCOP records showing careers of desambiguated crew members (we have more clusters, but this is part of ongoing research)
  2. das: data on the voyages of VOC ships from the Netherlands to Asia and back, normalized, with a few references to 1. and 3.
  3. bgb: additional data on voyages of VOC ships, also voyages within Asia, normalized, with references to DAS (see 2.). 
  4. vocPlaces.csv: gazetteer for locations in 1-3, with GeoNames/Wikidata URIs and lat/long values

 

Collection: vocop

  1. vocop.csv
    1. ID: VOCOP id number (assigned by National Archives — the institution that created this data collection)
    2. fullName: full name
    3. firstName: first name
    4. patronymic: patronymic
    5. familyNamePrefix: family name prefix
    6. familyName: family name
    7. placeOrigin: place of birth (original)
    8. vocPlaceID: id of place of birth, refers to vocop_place.csv
    9. rank: id of rank, refers to vocop_rank.csv
    10. dateBeginService: start date of work relation
    11. dateEndServiceNoZeros: end date of work relation
    12. reasonEndService: reason why employment ended; 'Chamber [placename]' means the sailor was dismissed from service after returning in the Netherlands by the VOC branch mentioned
    13. endServiceWhere: location where employment ended; this is often the name of the ship on which the sailor returned to the Netherlands
    14. voyageID: DAS voyageID of outward voyage, refers to das_voyage.csv
    15. monthLetter: 'Ja' when the employee appointed a beneficiary who could collect 3 months' worth of pay (every year)
    16. debtLetter: 'Ja' when the employee received advance money on signing up with the VOC (often used to pay for clothes and other gear needed on board)
    17. generalRemark: general remarks
    18. boardedAtCape: 'Ja' when record describes an employment that started at the Cape of Good Hope (is slightly complex: employment that started at the Cape are often nested within employment that started in NL; a sailor could e.g. start service in Amsterdam, sail to the Cape on ship 1, and then change ships at the Cape to ship 2, which would then be described in a new record that has a 'Ja' in this field)
    19. boardedAtCapeVoyageID: if boarded at Cape: DAS voyageID of ship for leg Cape - East Indies, refers to das_voyage.csv
    20. DASvoyageReturnID: DAS voyageID of return voyage, refers to das_voyage.csv
    21. sourceReference: reference to physical pay ledger in National Archives
    22. scanPermalink: URI of scan of original pay ledger record
  2. vocop_place.csv
    1. VOC_placeID: id, refers to vocop.csv
    2. placeOrigin: toponym attestation as in source
    3. vocUniqueStandardizedToponymID: id of standardized toponym refers to voc_places.csv
  3. vocop_rank.csv
    1. vocRankID: id, refers to vocop.csv
    2. rank: rank on board
    3. wage: minimum standard wage
    4. HISCO_CODE: HISCO category for rank
    5. HISCO_URI: HISCO URI (currently non-resolving; work in progress)
  4. vocop_careers.csv
    1. clusterID: ID of career cluster
    2. clusterRow: row number of record in cluster
    3. VOCOP_id: VOCOP id of record that forms part of cluster

 

 

Collection: das

  1. das_voyage.csv
    1. voyId: DAS voyage ID
    2. voyNumberDAS: original DAS voyage number
    3. shipNameVariantID: id of ship name, refers to das_shipNameVariant.csv
    4. voyMasterID: id for master (i.e. captain) of this voyage, refers to file das_master.csv
    1. voyMasterRemark: remark on master
    2. voyChamberID: id for VOC chamber (i.e. branch of company) that administered this voyage, refers to file das_chamber.csv
    3. voyDepartureEDTF: departure date of voyage
    4. voyDeparturePlaceID: id for departure place of voyage, refers to das_place.csv
    5. voyCapeArrivalEDTF: arrival date at Cape of Good Hope
    6. voyCapeArrivalEDTF_Remark: remark on arrival date at Cape of Good Hope
    7. voyCapeDepartureEDTF: departure date from Cape of Good Hope
    8. voyCapeDepartureEDTF_remark: remark on departure date from Cape of Good Hope
    9. voyArrivalDateEDTF: arrival date of voyage
    10. voyArrivalDateEDTF_remark: remark on arrival date of voyage
    11. voyArrivalPlaceID: id for arrival place of voyage, refers to das_place.csv
    12. voyInvoiceValue: value of goods transported
    13. voyChamber2ID: [don’t know? => irrelevant]
    14. voyParticulars: general remark on voyage
    15. voyCorrespondingNumber: [irrelevant]
    16. voyRGPDeel: reference to DAS book volume in which voyage was described
    17. voymaster_VOCOPVid: id of VOCOP record that describes employment of the master of this voyage
  2. das_ship.csv
    1. shipID: id number for ship
    2. voyTonnageMin: minimum tonnage of ship
    3. voyTonnageMax: maximum tonnage of ship (in case more than one tonnage in source_
    4. voyTypeOfShipID: id of ship type, refers to das_shipType.csv
    5. voyBuilt: [irrelevant]
    6. voyBuiltRemark: remark on acquisition of ownership
    7. voyBuiltY: year when ship was built / hired / bought
    8. voyYardYardID: id of yard where ship was built, refers to das_yard.csv
  3. das_shipNameVariant.csv
    1. shipNameVariantID: id number of ship name variant
    2. shipID: id number of ship, refers to das_ship.csv
    3. shipNameVariant: name variant of ship
    4. shipNameVariantRemark: remark on names of ship
  4. das_shipType.csv
    1. shipTypeID: is for ship type
    2. voyTypeOfShip: ship type
    3. voyTypeOfShipExternalID: external URI for ship type
  5. das_master.csv
    1. voyMasterID: id for masters (i.e. captains)
    2. voyMasterLastName: last name of master
    3. voyMasterFirstName: first name of master
    4. voyMasterFamilyNamePrefix: family name prefix of master
  6. das_onboard.csv
    1. onbId: id number for ‘onboard’ data
    2. onbVoyageId: voyage id, refers to das_voyage.csv
    3. onbCategory: crew category of below numbers (’total’ means whole crew, not divided in categories)
    4. onbI: number of crew at departure
    5. onbII: number of deaths between Netherlands and Cape
    6. onbIII: number of crew that left ship at Cape
    7. onbIV: number of crew that boarded at Cape
    8. onbV: number of deaths during whole voyage
    9. onbVI: number of crew upon arrival in Asia
  7. das_yard.csv
    1. yardID: id number for yard
    2. yard: yard name
    3. yardLocatedIn_standardizedToponym: place where yard was located
    4. uniqueStandardizedToponymID: id for place, refers to voc_places.csv
  8. chamber.csv
    1. chamID: id number for chamber (i.e. VOC branch)
    2. chamber: chamber name
    3. chamberFullName: chamber full name
    4. chamberLocatedIn_UniekeToponiemenVOCPOPV: place where chamber was located
    5. uniqueStandardizedToponymID: id for place, refers to voc_places.csv
  9. place.csv
    1. placeID: id number for places mentioned in das_voyage.csv
    2. toponym_original: toponym as mentioned in DAS
    3. toponym_standardized: standardized toponym
    4. uniqueStandardizedToponymID: id for place, refers to voc_places.csv

 

Collection: bgb

  1. bgb_cargo.csv
    1. carId: id number for cargo specification (each unit of cargo on board of ship during voyage gets an id number)
    2. carVoyageId: id number of voyage on which cargo was transported, refers to bgb_voyage
    3. carProductId: id number for product (i.e. type of cargo), refers to bgb_product
    4. carSpecificationId: id number for specification of cargo, refers to bgb_specification
    5. carUnit: id number for unit of account, refers to bgb_unit
    6. carQuantity: quantity (in units)
    7. carQuantityNumeric: quantity (metric value)
    8. carValue: total value of cargo
    9. carValueGuldens: value guilders
    10. carValueStuivers: value stuivers
    11. carValuePenningen: value penningen
    12. carValueLicht: total value of cargo in Indian money
    13. carValueLichtGuldens: value Indian guilders 
    14. carValueLichtStuivers: value Indian stuivers
    15. carValueLichtPenningen: value Indian penningen
    16. carRemarks: remarks on cargo
    17. carOrder: order in which cargo should be published (i.e.: 1 = first line)
    18. changed_when: provenance
    19. changed_by: provenance
    20. timestamp: provenance
    21. all_fields: ?
  2. bgb_place.csv
    1. id: id for place, refers to bgb_voyage.csv
    2. naam: toponym
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
    6. regio: id for region in which place was located, refers to bgb_regio
  3. bgb_product.csv
    1. id: id for product (i.e. type of cargo), refers to bgb_cargo
    2. naam: product name
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
  4. bgb_regio.csv
    1. id: id for region, refers to bgb_place
    2. naam: region toponym
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
  5. bgb_relVoyageShip.csv
    1. id: id number for voyage-ship relation
    2. voyId: id number of voyage, refers to bgb_voyage.csv
    3. shipId: id number for ship, refers to bgb_ship.csv
    4. timestamp: provenance
  6. bgb_ship.csv
    1. id: id number for ship, refers to bgb_relVoyageShip.csv
    2. naam: name of ship
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
  7. bgb_source.csv
    1. id: id number of source reference, refers to bgb_voyage.csv
    2. naam: inventory number of journal
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
  8. bgb_specification.csv
    1. id: id number of cargo specification, refers to bgb_cargo.csv
    2. naam: description of extra specification
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
  9. bgb_unit.csv
    1. id: id number of unit of measure, refers to bgb_cargo.csv
    2. naam: description of unit of measure
    3. added_when: provenance
    4. added_by: provenance
    5. timestamp: provenance
  10. bgb_voyage.csv
    1. voyId: id number of voyage, refers to bgb_cargo.csv and bgb_relVoyageShip.csv
    2. voyBookingDay: book date (day)
    3. voyBookingMonth: book date (month)
    4. voyBookingYear: book date (year)
    5. voyDeparturePlaceId: id for place of departure, refers to bgb_place.csv
    6. voyDepartureDay: departure date (day)
    7. voyDepartureMonth: departure date (month)
    8. voyDepartureYear: departure date (year)
    9. voyArrivalPlaceId: id for arrival place, refers to bgb_place.csv
    10. voyArrivalDay: arrival date (day)
    11. voyArrivalMonth: arrival date (month)
    12. voyArrivalYear: arrival date (year)
    13. voyInvoiceValue: value of cargo
    14. voyInvoiceValueLicht: value of cargo in Indian guilders
    15. voyRemarksForEditor: remarks
    16. voyDASNumber: corresponding DAS number of ship, refers to das_voyage.csv
    17. created_when: provenance
    18. created_by: provenance
    19. changed_when: provenance
    20. changed_by: provenance
    21. timestamp: provenance
    22. voySourceId: id of source reference, refers to bgb_source
    23. voynumber: id number of voyage [irrelevant, used on website]
    24. voyImage: [empty]
    25. voyRemarksForEndUser: remarks
    26. voyDepartureRegioId: id number for departure region, refers tot bgb_regio
    27. voyArrivalRegioId: id number for arrival region, refers tot bgb_regio
    28. voyFolioNummer: folio in source
    29. all_fields: [irrelevant]
    30. first_ship_name: name of first ship (if voyage concerns fleet)

 

File: voc_places.csv

  1. uniqueStandardizedToponymID: id number of standardized toponym
  2. uniqueStandardizedToponymCountryCode: standardized toponym, followed by country code
  3. URI: external URI for place
  4. LAT: lat value for place
  5. LNG: long value for place

 

Log in to post comments