PAHMA data cleaning -- Materials

Overview

The objects in the TMS database often have a free-text description of the materials and techniques used in their construction. This data is stored in the Medium field of the Objects table. For clarity, this value will be called a "MatVal". See Table 1 below.

The PAHMA Materials Authority has 331 values (hereinafter ThesaurusMatVals) each of which has a refName. Table 2 shows a few items in this authority.

The challenge is to match the existing MatVals of Objects in PAHMA-TMS to this authority, assign one or more refNames to each ObjectID.

Table 1: Some selected MatVal values for TMS objects

TMS query:

select ObjectID, ObjectName, Medium from Objects;

Count

Value

560

Ceramic: Terracotta

10

Ceramic: White ware

1

Ceramic: YW

8

Ceramic: Yellow Terracotta

1

Ceramic: black

28

Ceramic: black-top red ware

35

Ceramic: black-top red ware, Polish

1

Ceramic: black-top red ware?, Polish

2

Ceramic: burnt

2

Ceramic: glazed porcelain

...

2

Ceramic; Repaired

11

Ceramic; Shell

15

Ceramic; Shell;Textile

4

Ceramic; Stone

4

Ceramic; Stone; Wood; Shell; Metal: Copper

...

1

Clay (half baked)

1

Clay, Charcoal, Tobacco

1

Clay, cotton fabric, plastic “leather”, wood, paint, yarn

2

Clay, glazed

2

Clay, glazed, painted

1

Clay, half-fired

1

Clay, paint, horse hair, fabric, beacs

1

Clay, paint, maple bark, beads

1

Clay, probably hand-modelled

1

Clay, wood, cactus, paint, feathers, buckskin, yarn

1

Clay, wood, wool, fabric, simulated leather, paint

Table 2: Selected rows from the Materials Authority index

(NB: these are from the dev.cspace materials authority, for demonstration only)

CSID

displayName

refName

8758f0cd-bdea-411a-b48c-652b0331b1eb

Baleen

urn:...:name(material_ca):item:name(mat1564794)'Baleen'

2b405dc3-735f-4bf8-99a0-fda6e18bb751

Balsa

urn:...:name(material_ca):item:name(mat1564792)'Balsa'

a5233fc6-559c-4a8c-bd03-f8847bf5ea69

Bamboo

urn:...:name(material_ca):item:name(mat1564764)'Bamboo'

02d8d815-8b81-4bc3-9dff-14aab9a5109f

Banana

urn:...:name(material_ca):item:name(mat1564796)'Banana'

4af9bd80-551b-4ba6-a603-eb8fec034fb7

Banana fiber

urn:...:name(material_ca):item:name(mat1564795)'Banana fiber'

5cc23005-b7c9-44e4-88e4-482cd537e362

Banana leaf

urn:...:name(material_ca):item:name(mat1564797)'Banana leaf'

Matching Algorithm

A two-step method is used:

  • If the string value of the MatField matches the string value of a ThesaurusMatVal, we are done. (NB: the match is case-insensitive; there are no identified cases where case-folding causes a problem.)
  • Since many MatFields specify more than one material, or add qualifications to the material specification, it is necessary to parse the non-exact-matching fields into chunks and attempt to match chunks and parts of chunks.

The first step is computationally trivial: a string match of case-folded values. However, looking at the unmatched residue, a significant number (almost 15%) could be classified by adding the following 24 values to the TMS Thesaurus. (NB: the ThesaurusMatKey values suggested are purely faculative; the actual values should be looked up)

Material or Medium

Taxonomy Key

wax cylinder

AUT.audio

faience

AUT.faience

feathers

AUT.feathers

film & glass plate

AUT.film

film

AUT.film

16mm

AUT.film

16mm work copy

AUT.film

16mm original

AUT.film

35 mm slide

AUT.film

16mm original outs

AUT.film

video

AUT.film

super 8

AUT.film

nut

AUT.nut

papyrus

AUT.papyrus

plate

AUT.photograph

albumen silver print

AUT.photograph

plate

AUT.photograph

shells

AUT.shell

ash, volcanic

AUT.stone

plaster

AUT.stone

fiber

AUT.textile

cane

AUT.wood

embrodiery

AUT.technique

lace

AUT.technique

The second slightly more complicated step entails taking apart the unmatched values and trying to match them (with near-perfect precison) to the taxonomy. We are willing to sacrifice some recall (i.e. leave some values unclassified) in order to ensure that the results that do come through are accurate. The parade examples of problem values are "gold, by Paul Smith" (which should be classified as "Metal: Gold"), and "Frank Ivory" (which should not be classified at all, since it represents the name of a person).

It remains to deal with the (51,868) cases where an exact match was not found, e.g. "film & glass plate", "ceramic: terracotta" or "stone, alabaster"). These MatVals were tokenized ("split") on a set of delimiters and the tokens were looked up in the TMS taxonomy. When more than one match was found, the most specific TMSMatKey was used. Several subcases exist: when there are 2 keys, and the second is more specific (i.e. one of the descendents of the first), the second key is kept ("2nd specific" in the table). At the moment, the remaining MatFields are divided into two piles – unclassified ("no keys found") and one or more non-specific keys found ("multiple").

Table 3: Summary of matching results

type of match

types

tokens

token pct. (coverage)

no keys found

11631

9.0

all matched

24694

19.1

exact

79177

61.4

multiple

13461

10.4

total

128963

100.0

tms keys

359

0.3

safe words

254

0.2

Table 4 shows the "head" of the frequency distribution of case-folded MatVal values along with the results of the analysis. You can see that most of them exactly match a ThesaurusMatVal. Note also that of the 647,899 objects, 519,936 (= 80.0%) have no material specified (i.e. the MatVal value is null in the database). This leaves 128,963 values requiring classification. Considering all the non-null MatVal values, 103,871 (= 79,177+24,694 = 80.5%) match one or more TMSMatVals (i.e. "all matched + exact").

$ perl pahmaMatchMaterials.pl pahma.cspace.berkeley.edu.conceptauthorities.660a70af-001a-49c2-96f2.items safe.csv materials.fix.unique.20120614.csv > materials.match.20120614.csv

$ wc materials.*2012*.csv

  649079 1855321 13604620 materials.fix.unique.20120614.csv
  173127 1234493 22926860 materials.match.20120614.csv

$ cut -f1 materials.match.20120614.csv | sort -u | wc -l
  128963

Table 4: Most frequent matching Materials Authority against object material descriptions

$ $ cut -f2,4,6 materials.match.20120614.csv | sort | uniq -c | sort -rn | head -30 | perl -ne 'chomp;s/^ *(\d+) /\1\t/;s/\t/ | /g;print "| $_ |\n"'

Count

MatField

Term name matched, if any

match type

44689

Obsidian

Obsidian

exact

13824

Ceramic

Ceramic

exact

13245

Film & glass plate

Glass plate

all matched

13245

Film & glass plate

Film

all matched

3862

ceramic

Ceramic

exact

3179

film

Film

exact

2780

obsidian

Obsidian

exact

2412

Wax cylinder

 

no keys found

1337

Wood

Wood

exact

1234

Shell: Spondylus pictorum

Spondylus pictorum

all matched

1234

Shell: Spondylus pictorum

Shell

all matched

1229

Textile

 

no keys found

933

Stone, Alabaster

Alabaster

multiple

933

Stone, Alabaster

 

multiple

845

Stone

 

no keys found

843

Stone, Flint

Flint

multiple

843

Stone, Flint

 

multiple

778

Shell: Conus Fergusoni

Shell

all matched

778

Shell: Conus Fergusoni

Conus fergusoni

all matched

650

AE

 

no keys found

627

Chert

Chert

exact

609

Flint

Flint

exact

560

Ceramic: Terracotta

Terracotta

all matched

560

Ceramic: Terracotta

Ceramic

all matched

480

wood

Wood

exact

442

Metal

Metal

exact

437

wool

Wool

exact

420

Faience

Faience

exact

414

Shell

Shell

exact

Table 5: Selected output rows in dev.cspace.berkeley.edu:/home/developers/csv/materials.match.csv

ObjectID

Free text value

refName

displayName

matchType

420

Human thigh bone, skin, silver, copper, cord

urn: ... :name(material_ca):item:name(mat1551463)'Bone'

Bone

multiple

420

Human thigh bone, skin, silver, copper, cord

urn: ... :name(material_ca):item:name(mat1558479)'Skin'

Skin

multiple

420

Human thigh bone, skin, silver, copper, cord

urn: ... :name(material_ca):item:name(mat1552743)'Silver'

Silver

multiple

420

Human thigh bone, skin, silver, copper, cord

urn: ... :name(material_ca):item:name(mat1558345)'Copper'

Copper

multiple

420

Human thigh bone, skin, silver, copper, cord

 

Human thigh bone, skin, silver, copper, cord

multiple

 

 

 

424

steel, leather, wood

urn: ... :name(material_ca):item:name(mat1551092)'Steel'

Steel

all matched

424

steel, leather, wood

urn: ... :name(material_ca):item:name(mat1558324)'Leather'

Leather

all matched

424

steel, leather, wood

urn: ... :name(material_ca):item:name(mat1558086)'Wood'

Wood

all matched

 

 

 

428

bronze

urn: ... :name(material_ca):item:name(mat1552753)'Bronze'

Bronze

exact

 

 

 

430

leater, felt, embroidery

 

leater, felt, embroidery

no keys found

Notes

  • One would expect that MatFields like "Stone, Flint" should match only once, but the string "Stone" matches as well as "Flint". A slight enhancement to the program would get this right, result in a few percentage point improvement in recall.

How a "multiple" record will appear in the Materials section of the UI

A Single Materials Authority item

<list-item>
<csid>494c79ee-73fd-4967-bb84-dc904719ad01</csid>
<uri>
/conceptauthorities/660a70af-001a-49c2-96f2/items/494c79ee-73fd-4967-bb84-dc904719ad01
</uri>
<updatedAt>2012-06-07T02:31:19Z</updatedAt>
<workflowState>project</workflowState>
<termDisplayName>Anhydrite</termDisplayName>
<shortIdentifier>mat1564807</shortIdentifier>
<refName>
urn:cspace:pahma.cspace.berkeley.edu:conceptauthorities:name(material_ca):item:name(mat1564807)'Anhydrite'
</refName>
<termStatus>accepted</termStatus>
</list-item>