PAHMA data cleaning -- Materials
Overview
The objects in the TMS database often have a free-text description of the materials and techniques used in their construction. This data is stored in the Medium field of the Objects table. For clarity, this value will be called a "MatVal". See Table 1 below.
The PAHMA Materials Authority has 331 values (hereinafter ThesaurusMatVals) each of which has a refName. Table 2 shows a few items in this authority.
The challenge is to match the existing MatVals of Objects in PAHMA-TMS to this authority, assign one or more refNames to each ObjectID.
Table 1: Some selected MatVal values for TMS objects
TMS query:
select ObjectID, ObjectName, Medium from Objects;
Count |
Value |
---|---|
560 |
Ceramic: Terracotta |
10 |
Ceramic: White ware |
1 |
Ceramic: YW |
8 |
Ceramic: Yellow Terracotta |
1 |
Ceramic: black |
28 |
Ceramic: black-top red ware |
35 |
Ceramic: black-top red ware, Polish |
1 |
Ceramic: black-top red ware?, Polish |
2 |
Ceramic: burnt |
2 |
Ceramic: glazed porcelain |
... |
|
2 |
Ceramic; Repaired |
11 |
Ceramic; Shell |
15 |
Ceramic; Shell;Textile |
4 |
Ceramic; Stone |
4 |
Ceramic; Stone; Wood; Shell; Metal: Copper |
... |
|
1 |
Clay (half baked) |
1 |
Clay, Charcoal, Tobacco |
1 |
Clay, cotton fabric, plastic “leather”, wood, paint, yarn |
2 |
Clay, glazed |
2 |
Clay, glazed, painted |
1 |
Clay, half-fired |
1 |
Clay, paint, horse hair, fabric, beacs |
1 |
Clay, paint, maple bark, beads |
1 |
Clay, probably hand-modelled |
1 |
Clay, wood, cactus, paint, feathers, buckskin, yarn |
1 |
Clay, wood, wool, fabric, simulated leather, paint |
Table 2: Selected rows from the Materials Authority index
(NB: these are from the dev.cspace materials authority, for demonstration only)
CSID |
displayName |
refName |
---|---|---|
8758f0cd-bdea-411a-b48c-652b0331b1eb |
Baleen |
urn:...:name(material_ca):item:name(mat1564794)'Baleen' |
2b405dc3-735f-4bf8-99a0-fda6e18bb751 |
Balsa |
urn:...:name(material_ca):item:name(mat1564792)'Balsa' |
a5233fc6-559c-4a8c-bd03-f8847bf5ea69 |
Bamboo |
urn:...:name(material_ca):item:name(mat1564764)'Bamboo' |
02d8d815-8b81-4bc3-9dff-14aab9a5109f |
Banana |
urn:...:name(material_ca):item:name(mat1564796)'Banana' |
4af9bd80-551b-4ba6-a603-eb8fec034fb7 |
Banana fiber |
urn:...:name(material_ca):item:name(mat1564795)'Banana fiber' |
5cc23005-b7c9-44e4-88e4-482cd537e362 |
Banana leaf |
urn:...:name(material_ca):item:name(mat1564797)'Banana leaf' |
Matching Algorithm
A two-step method is used:
- If the string value of the MatField matches the string value of a ThesaurusMatVal, we are done. (NB: the match is case-insensitive; there are no identified cases where case-folding causes a problem.)
- Since many MatFields specify more than one material, or add qualifications to the material specification, it is necessary to parse the non-exact-matching fields into chunks and attempt to match chunks and parts of chunks.
The first step is computationally trivial: a string match of case-folded values. However, looking at the unmatched residue, a significant number (almost 15%) could be classified by adding the following 24 values to the TMS Thesaurus. (NB: the ThesaurusMatKey values suggested are purely faculative; the actual values should be looked up)
Material or Medium |
Taxonomy Key |
---|---|
wax cylinder |
AUT.audio |
faience |
AUT.faience |
feathers |
AUT.feathers |
film & glass plate |
AUT.film |
film |
AUT.film |
16mm |
AUT.film |
16mm work copy |
AUT.film |
16mm original |
AUT.film |
35 mm slide |
AUT.film |
16mm original outs |
AUT.film |
video |
AUT.film |
super 8 |
AUT.film |
nut |
AUT.nut |
papyrus |
AUT.papyrus |
plate |
AUT.photograph |
albumen silver print |
AUT.photograph |
plate |
AUT.photograph |
shells |
AUT.shell |
ash, volcanic |
AUT.stone |
plaster |
AUT.stone |
fiber |
AUT.textile |
cane |
AUT.wood |
embrodiery |
AUT.technique |
lace |
AUT.technique |
The second slightly more complicated step entails taking apart the unmatched values and trying to match them (with near-perfect precison) to the taxonomy. We are willing to sacrifice some recall (i.e. leave some values unclassified) in order to ensure that the results that do come through are accurate. The parade examples of problem values are "gold, by Paul Smith" (which should be classified as "Metal: Gold"), and "Frank Ivory" (which should not be classified at all, since it represents the name of a person).
It remains to deal with the (51,868) cases where an exact match was not found, e.g. "film & glass plate", "ceramic: terracotta" or "stone, alabaster"). These MatVals were tokenized ("split") on a set of delimiters and the tokens were looked up in the TMS taxonomy. When more than one match was found, the most specific TMSMatKey was used. Several subcases exist: when there are 2 keys, and the second is more specific (i.e. one of the descendents of the first), the second key is kept ("2nd specific" in the table). At the moment, the remaining MatFields are divided into two piles – unclassified ("no keys found") and one or more non-specific keys found ("multiple").
Table 3: Summary of matching results
type of match |
types |
tokens |
token pct. (coverage) |
---|---|---|---|
no keys found |
11631 |
9.0 |
|
all matched |
24694 |
19.1 |
|
exact |
79177 |
61.4 |
|
multiple |
13461 |
10.4 |
|
total |
128963 |
100.0 |
tms keys |
359 |
0.3 |
safe words |
254 |
0.2 |
Table 4 shows the "head" of the frequency distribution of case-folded MatVal values along with the results of the analysis. You can see that most of them exactly match a ThesaurusMatVal. Note also that of the 647,899 objects, 519,936 (= 80.0%) have no material specified (i.e. the MatVal value is null in the database). This leaves 128,963 values requiring classification. Considering all the non-null MatVal values, 103,871 (= 79,177+24,694 = 80.5%) match one or more TMSMatVals (i.e. "all matched + exact").
$ perl pahmaMatchMaterials.pl pahma.cspace.berkeley.edu.conceptauthorities.660a70af-001a-49c2-96f2.items safe.csv materials.fix.unique.20120614.csv > materials.match.20120614.csv $ wc materials.*2012*.csv 649079 1855321 13604620 materials.fix.unique.20120614.csv 173127 1234493 22926860 materials.match.20120614.csv $ cut -f1 materials.match.20120614.csv | sort -u | wc -l 128963
Table 4: Most frequent matching Materials Authority against object material descriptions
$ $ cut -f2,4,6 materials.match.20120614.csv | sort | uniq -c | sort -rn | head -30 | perl -ne 'chomp;s/^ *(\d+) /\1\t/;s/\t/ | /g;print "| $_ |\n"'
Count |
MatField |
Term name matched, if any |
match type |
---|---|---|---|
44689 |
Obsidian |
Obsidian |
exact |
13824 |
Ceramic |
Ceramic |
exact |
13245 |
Film & glass plate |
Glass plate |
all matched |
13245 |
Film & glass plate |
Film |
all matched |
3862 |
ceramic |
Ceramic |
exact |
3179 |
film |
Film |
exact |
2780 |
obsidian |
Obsidian |
exact |
2412 |
Wax cylinder |
|
no keys found |
1337 |
Wood |
Wood |
exact |
1234 |
Shell: Spondylus pictorum |
Spondylus pictorum |
all matched |
1234 |
Shell: Spondylus pictorum |
Shell |
all matched |
1229 |
Textile |
|
no keys found |
933 |
Stone, Alabaster |
Alabaster |
multiple |
933 |
Stone, Alabaster |
|
multiple |
845 |
Stone |
|
no keys found |
843 |
Stone, Flint |
Flint |
multiple |
843 |
Stone, Flint |
|
multiple |
778 |
Shell: Conus Fergusoni |
Shell |
all matched |
778 |
Shell: Conus Fergusoni |
Conus fergusoni |
all matched |
650 |
AE |
|
no keys found |
627 |
Chert |
Chert |
exact |
609 |
Flint |
Flint |
exact |
560 |
Ceramic: Terracotta |
Terracotta |
all matched |
560 |
Ceramic: Terracotta |
Ceramic |
all matched |
480 |
wood |
Wood |
exact |
442 |
Metal |
Metal |
exact |
437 |
wool |
Wool |
exact |
420 |
Faience |
Faience |
exact |
414 |
Shell |
Shell |
exact |
Table 5: Selected output rows in dev.cspace.berkeley.edu:/home/developers/csv/materials.match.csv
ObjectID |
Free text value |
refName |
displayName |
matchType |
---|---|---|---|---|
420 |
Human thigh bone, skin, silver, copper, cord |
urn: ... :name(material_ca):item:name(mat1551463)'Bone' |
Bone |
multiple |
420 |
Human thigh bone, skin, silver, copper, cord |
urn: ... :name(material_ca):item:name(mat1558479)'Skin' |
Skin |
multiple |
420 |
Human thigh bone, skin, silver, copper, cord |
urn: ... :name(material_ca):item:name(mat1552743)'Silver' |
Silver |
multiple |
420 |
Human thigh bone, skin, silver, copper, cord |
urn: ... :name(material_ca):item:name(mat1558345)'Copper' |
Copper |
multiple |
420 |
Human thigh bone, skin, silver, copper, cord |
|
Human thigh bone, skin, silver, copper, cord |
multiple |
|
|
|
||
424 |
steel, leather, wood |
urn: ... :name(material_ca):item:name(mat1551092)'Steel' |
Steel |
all matched |
424 |
steel, leather, wood |
urn: ... :name(material_ca):item:name(mat1558324)'Leather' |
Leather |
all matched |
424 |
steel, leather, wood |
urn: ... :name(material_ca):item:name(mat1558086)'Wood' |
Wood |
all matched |
|
|
|
||
428 |
bronze |
urn: ... :name(material_ca):item:name(mat1552753)'Bronze' |
Bronze |
exact |
|
|
|
||
430 |
leater, felt, embroidery |
|
leater, felt, embroidery |
no keys found |
Notes
- One would expect that MatFields like "Stone, Flint" should match only once, but the string "Stone" matches as well as "Flint". A slight enhancement to the program would get this right, result in a few percentage point improvement in recall.
How a "multiple" record will appear in the Materials section of the UI
A Single Materials Authority item
<list-item> <csid>494c79ee-73fd-4967-bb84-dc904719ad01</csid> <uri> /conceptauthorities/660a70af-001a-49c2-96f2/items/494c79ee-73fd-4967-bb84-dc904719ad01 </uri> <updatedAt>2012-06-07T02:31:19Z</updatedAt> <workflowState>project</workflowState> <termDisplayName>Anhydrite</termDisplayName> <shortIdentifier>mat1564807</shortIdentifier> <refName> urn:cspace:pahma.cspace.berkeley.edu:conceptauthorities:name(material_ca):item:name(mat1564807)'Anhydrite' </refName> <termStatus>accepted</termStatus> </list-item>