Sorting and Searching PAHMA Object Numbers
Analysis of the data in TMS and the TMS approach to sorting
PAHMA Object Numbers (also known as "Museum Numbers" in the PAHMA UI and elsewhere) are unique string identifiers, one per object or subobject.
Most of the identifiers (88.9%) follow a simple pattern: a short sequence of digits, a hyphen, and another string of digits, e.g. "1-123456". And 94.1% follow a slight variation: this pattern plus an "excresence" in the form of a few letters, e.g. "1-23456a,b". The remaining 5.9% (about 48,500 objects) fall into one of about 250 more varied patterns. The most frequent patterns are show in Table 1 below (up to a cumulative frequency of 99%).
- "A" below means "a sequence of capital letters"
- "a" below means "a sequence of lowercase letters"
- "9" below means "a sequence of digits"
Table 1: Frequency Distribution of ObjectNumber ("Museum Number") Patterns in TMS
n=647,902
Pattern |
Frequency |
Relative F. |
Cumulative F. |
Example |
---|---|---|---|---|
9-9 |
575744 |
88.86% |
88.86% |
1-2345 |
A-9 |
22974 |
3.55% |
92.41% |
TB-192 |
9-9a |
7857 |
1.21% |
93.62% |
1-45678b |
9-9a,a |
7085 |
1.09% |
94.72% |
1-47899a,b |
9-9_a |
6441 |
0.99% |
95.71% |
|
Aa.9 |
4705 |
0.73% |
96.44% |
Box.99 |
9.9-9 |
3776 |
0.58% |
97.02% |
|
9.9 |
3638 |
0.56% |
97.58% |
|
9-9a-a |
3450 |
0.53% |
98.11% |
|
9-9.A#9.A |
1753 |
0.27% |
98.38% |
|
A-A9 |
1723 |
0.27% |
98.65% |
|
9-9.A#9.A.9 |
892 |
0.14% |
98.79% |
|
Aa.9A |
886 |
0.14% |
98.92% |
|
A-9a |
801 |
0.12% |
99.05% |
New and Proposed Museum Number types
Going forward, in CSpace, PAHMA will begin to assign numbers using the "Number Patterns" provided in the UI. Some of these patterns do occur, infrequently, in the existing TMS database. Table 2 shows the various prefixes and numbering patterns that are to be used.
Table 2: Museum Number Patterns to be available in CSpace
Pattern |
Description |
Remarks |
---|---|---|
HMA 2012.1.2 |
Cataloging |
|
LI 2012.1.2 |
In Loan |
|
LO 2012.1.2 |
Out Loan |
|
ACC 2012.1.2 |
Accession |
|
Inventory/Movement |
||
EX 2012.1.2 |
Object Exit |
|
IN 2012.1.2 |
Object Entry |
|
CL 2012.1.2 |
NAGPRA Claim |
|
Group |
|
|
Media |
|
|
Patterns for some of the |
||
CR 2012.1.2 |
Condition Checking |
|
INS 2012.1.2 |
Insurance |
|
VAL 2012.1.2 |
Valuation |
|
EXH 2012.1.2 |
Exhibition |
|
USE 2012.1.2 |
Use of Collections |
|
TR 2012.1.2 |
Treatment |
|
A few points about how these should be represented and collated:
- Some of the patterns above are not going to be used as Museum Numbers; there is no reason therefore that they should collate like them. However, they will need to collate amongst themselves, and so will eventually deserve similiar (and hopefully identical) treatment.
- A space between the alphabetic prefix and the number is desired, but the sortkey algorithm should accommodate numbers without a space.
- The key of these types of numbers should sort in 4 levels, the first alphabetic, the next 3 numeric.
- While the prefixes could be enumerated (they will be in the Number Patterns UI widget), it would be nice if the algorithm could handle new prefixes without requiring the code to be updated.
TMS Sort Key Generation
TMS employs an algorithm which generates a sort key from the input ObjectNumber value. This value is stored in the SortNumber column of the Objects table and used for collation and perhaps other purposes. The value is configurable by the user in consultation with TMS staff. The TMS manual contains the following configuration information:
Field |
Example |
Description |
---|---|---|
SortNumber.PartOrder |
1.2.3 |
Specifies the order in which the different parts of the SortNumber Template should be joined. |
SortNumber.Template |
XXXXXX.XXXXXX.X XXXX.XXXXX.XXXX .XXX.XXX.XXXXX.X XXXX.XXXXX.XXXX X.XXXX |
The template for object numbers (DO NOT CHANGE this setting without consulting with Gallery Systems support staff; see “Updating SortNumber Templates” on page 26-26) |
TMS Users' Manual 26-14
(Note that this same template and instruction is used in TMS in a variety of places besides ObjectNumber.)
Options for PAHMA structuredObjectNumber in CSpace
Two options for migrating and maintaining a sortable value for objectNumbers in CSpace:
- Migrate the existing values of the SortNumber field to CSpace and reverse engineer the TMS method for creating them in CSpace. There is little value and some risk in doing this: since new ObjectNumbers will continue to be created, whatever algorithm (and implementation) is devised for CSpace it can be employed as part of the migration process to create new sortNumbers. Note also that the TMS implementation appears to have some problems and not all the values in the existing SortNumber field seem to be appropriate. Furthermore, implementation details for the TMS procedure are not included in the documentation, and may be hard to come by; at any rate, there is a risk that the re-implementation might not match the current implementation in some significant detail.
- Devise a new algorithm and implementation, hopefully relatively general and straightforward, that can be used to both re-generate sort keys as part of migration and to create new sort keys in the future.
The proposal for the second option is presented in more detail below.
Recommendations for PAHMA structuredObjectNumber in CSpace
It is desired to convert the objectNumber value into a scalar value which supports collation and range searching. This could result in a numeric or hash value being produced, but having a human-readable key that bears a structural resemblance to the original has some advantages.
As noted above, almost 95% of the existing objectNumbers in TMS match a relatively simple regex, and that this expression can be modified to handle the new _number patterns_. The remaining identifiers are pretty much all over the map, but could also be collated using a relatively simply algorithm (see below), and one which might be robust enough to serve as a general template for creating such sort keys in CSpace.
The main pattern has up to 5 parts:
PREFIX NUMBER NUMBER NUMBER ETC
Therefore, a two part algorithm is suggested:
- If the identifier matches the "main" 5-part pattern:
we are nearly done: Ignore the first match group (the optional "component prefix" C. and its variants) and concatenate the five captured tokens (excluding punctuation) with blank as separator (padding to length 6 with zeroes if the token is numeric). If the first element is alphabetic, include it, otherwise omit it.
/ /^([cC](ons|ONS)?[\-\. ]?)?([A-Z]+)?([\-\. ])?(\d+)([\-\. ])?(\d+)([\.\- ]+)?(\d+)?([\.\- ]+)?(.*)$/
Note that this suggestion has the nice property that the original value is easily retrieved from this sort key, which it closely resembles.
Another fine point to note is that some "non-standard" input strings match the pattern (esp. cases which have an alphabetic first element and a numeric second element) and these get reformatted. This should be regarded as a feature not a bug since in the vast majority of such cases the result is an improved collation of these values, and it is unlikely that they will incorrectly collate with standard values. However, removing the \w from the third captured element (first output element) would element most of these from reformatting. - For the remaining 5% that do not match this pattern, three options are possible. For PAHMA, it appears that the "pathological cases" below do not need to be handled, so "leave them as is" is fine as the ELSEWHERE case for objectNumbers that do not match the main pattern.!
Javascript snippet implementing algorithm suggested above (includes test cases)
pad = function(s, len){ if (len-s.length+1 > 0) { return (new Array((len-s.length+1) ).join("0")) + s; } else { return s; } }; isNumber = function(n) { return !isNaN(parseFloat(n)) && isFinite(n); } function testValues() { var resultString = "<table>"; var objnums = { 'Do Not Touch!':0, 'xxx,xxx 99':0, 'HMA 2012.1.20 b':4, '2012.10.20 b':4, '2012.1':4, 'HMA 2012.01.20 b':4, 'LI 2012.11.200':4, 'LO 2012.1.20':4, 'LO 2012.1.200':4, 'LO2012.1.19':4, 'LO 2012.1.20a':4, 'ACC 2012.1.2':4, 'ACC 2012.1.02x':4, 'c-1-210':4,'C.99-1234-a':6, 'C99-1234.9':6,'cons99-1234.92':6, 'C 99-1234.93':6,'cons 99-1234.94':6, 'CONS.99-1234-a99':6, 'Cons.1-10011':1,'Cons1-10011':1, 'IJ 1':1, 'IJ-1':1, 'K-1993m':1, 'L 54.739.2':1, 'L-16576e,f,h,i':1, '1-2':1,'1-200a':3, '1.3 a,b':2,'99-12345678912345 xx':5} for (objnum in objnums) { // console.log(objnum,createStructuredObjectNumber(objnum)); resultString += "<tr><td>" + objnum + "<td>" + createStructuredObjectNumber(objnum); } resultString += "</table>"; document.getElementById('results').innerHTML = '<pre>' + resultString + '</pre>'; } function createStructuredObjectNumber(objnum) { // 1 2 3 4 5 6 7 8 9 10 11 var objRe = /^([cC](ons|ONS)?[\-\. ]?)?([A-Z]+)?([\-\. ])?(\d+)([\-\. ])?(\d+)?([\.\- ]+)?(\d+)?([\.\- ]+)?(.*)$/; var objTokens = objRe.exec(objnum); if (objTokens == null) { return objnum; } else { for ( i = 0 ; i < objTokens.length ; i = i+1 ) { if (!objTokens[i]) { objTokens[i] = ''; } else { if (isNumber(objTokens[i])) { objTokens[i] = pad(objTokens[i],6); } } objTokens[i] = objTokens[i] + ' '; } if (objTokens[3] == ' ') objTokens[3] == ''; // zap empty alphabetic prefix return objTokens[3]+objTokens[5]+objTokens[7]+objTokens[9]+objTokens[11]; } }
Some "Ordinary" Cases
(most of these match the "main pattern", though in some cases the resulting key is not strictly collatable)
ORIGINAL |
PROPOSED KEY |
---|---|
Do Not Touch! |
Do Not Touch! |
xxx,xxx 99 |
xxx,xxx 99 |
HMA 2012.1.20 b |
HMA 002012 000001 000020 b |
2012.10.20 b |
002012 000010 000020 b |
2012.1 |
002012 000001 |
HMA 2012.01.20 b |
HMA 002012 000001 000020 b |
LI 2012.11.200 |
LI 002012 000011 000200 |
LO 2012.1.20 |
LO 002012 000001 000020 |
LO 2012.1.200 |
LO 002012 000001 000200 |
LO2012.1.19 |
LO 002012 000001 000019 |
LO 2012.1.20a |
LO 002012 000001 000020 a |
ACC 2012.1.2 |
ACC 002012 000001 000002 |
ACC 2012.1.02x |
ACC 002012 000001 000002 x |
c-1-210 |
000001 000210 |
C.99-1234-a |
000099 001234 a |
C99-1234.9 |
000099 001234 000009 |
cons99-1234.92 |
000099 001234 000092 |
C 99-1234.93 |
000099 001234 000093 |
cons 99-1234.94 |
000099 001234 000094 |
CONS.99-1234-a99 |
000099 001234 a99 |
Cons.1-10011 |
000001 010011 |
Cons1-10011 |
000001 010011 |
IJ 1 |
IJ 000001 |
IJ-1 |
IJ 000001 |
K-1993m |
K 001993 m |
L 54.739.2 |
L 000054 000739 000002 |
L-16576e,f,h,i |
L 016576 e,f,h,i |
1-2 |
000001 000002 |
1-200a |
000001 000200 a |
1.3 a,b |
000001 000003 a,b |
99-12345678912345 xx |
000099 12345678912345 xx |
More "Pathological" Cases
NB: the following sections are superceded, but is retained here to remind us to check these pathological cases at some point.
A more general algorithm, that can collate mixed strings of number, letters, and punctuation pretty well is the following, preserved here purely for speculative interest:
- perform a two-step tokenization and reassembly of the parts as follows:
- insert whitespace between each juncture of letter and digit:
s/([a-zA-Z]+)([0-9]+)/\1 \2/g; s/([0-9]+)([a-zA-Z]+)/\1 \2/g;
- split the identifier into tokens on white space and selected puncuation, i.e.
split("[\s\_\-\.\;\:\,\-\+\/\#]+")
- reassemble the tokens, with padding to fixed lengths, see script below.
- insert whitespace between each juncture of letter and digit:
These are handled by the "elsewhere" case in the script below. MTB says that most of these are ephemeral, apocryphal, or otherwise moot (i.e. they are accession numbers or "virtual objects" that do not refer to real objects, despite being in the ObjectNumber column in Objects). Nevertheless, we would be wise to spend just a little while considering identifiers like these.
ORIGINAL PATTERN PROPOSED KEY TMS SORTNUM IJ 1 IJ-000001 IJ-1 IJ-000001 K-1993m K-001993^m L 54.739.2 L-000054^739.2 L-16576e,f,h,i L-016576^e,f,h,i 100.1-145272 9.9-9 0100-000001^145272 ^^^100^^^^^114527^^^^2 100.1-145274 9.9-9 0100-000001^145274 ^^^100^^^^^114527^^^^4 100.1-145275 9.9-9 0100-000001^145275 ^^^100^^^^^114527^^^^5 1-154197.dup 9-9.a 0001-154197^dup ^^^^^1154197dup 1-205632_dupe 9-9_a 0001-205632^dupe ^^^^^1205632dup 1-217004_dupe 9-9_a 0001-217004^dupe ^^^^^1217004 1-217005_dupe 9-9_a 0001-217005^dupe ^^^^^1217005 1-217006_dupe 9-9_a 0001-217006^dupe ^^^^^1217006 14-1 9-9 0014-000001^ ^^^^14^^^^^1 14-10.1 9-9.9 0014-000010^1 ^^^^14^^^^10^^^^1 14-10.1.dup.1 9-9.9.a.9 0014-000010^1.dup.1 ^^^^14^^^^10^^^^1dup^^^^^1 14-10.1.dup.2 9-9.9.a.9 0014-000010^1.dup.2 ^^^^14^^^^10^^^^1dup^^^^^2 14-100 9-9 0014-000100^ ^^^^14^^^100 14-1000a.1 9-9a.9 0014-001000^a.1 ^^^^14^^1000a^^^^^^^^1 14-1000a.1.dup.1 9-9a.9.a.9 0014-001000^a.1.dup.1 ^^^^14^^1000a^^^^^^^^1dup^^^1 14-1000b.2 9-9a.9 0014-001000^b.2 ^^^^14^^1000b^^^^^^^^2 14-1000b.2.dup.1 9-9a.9.a.9 0014-001000^b.2.dup.1 ^^^^14^^1000b^^^^^^^^2dup^^^1 14-1000c.3 9-9a.9 0014-001000^c.3 ^^^^14^^1000c^^^^^^^^3 1-TEMP01a 9-A9a ^^^1TEMP^^01^^^a ^^^^^1TEMP^^^^^^1a 1-TEMP01b 9-A9a ^^^1TEMP^^01^^^b ^^^^^1TEMP^^^^^^1b 1-TEMP02 9-A9 ^^^1TEMP^^02 ^^^^^1TEMP^^^^^^2 1-TEMP03 9-A9 ^^^1TEMP^^03 ^^^^^1TEMP^^^^^^3 1168.Crem 1 9.Aa 9 1168Crem^^^1 ^^1168Crem^^^^^^1 1175.Crem 5 9.Aa 9 1175Crem^^^5 ^^1175Crem^^^^^^5 24-2382.S#12.A and B 9-9.A#9.A a A 0024-002382^S#12.A^and^B ^^^^24^^2382S^^^^^^^12A^^^andB 4-"Blank" Box (1 of 3) Broughton 9-"Aa" Aa (9 a 9) Aa ^^^4Blank^Box^^^1^^of^^^3Broughton x_FaunalBox^^^^^^^1^broughton 4-"Blank" Box (2 of 3) 9-"Aa" Aa (9 a 9) ^^^4Blank^Box^^^2^^of^^^3 x_FaunalBox^^^^^^^2^broughton Box #1 (Sac #1?) (2 of 2) Aa #9 (Aa #9?) (9 a 9) ^Box^^^1^Sac^^^1^^^2^^of^^^2 x_FaunalBox^^^^^^^1^2 Box #2 (Sac #2?) (1 of 2) Orig. Box 5 Aa #9 (Aa #9?) (9 a 9) Aa. Aa 9 ^Box^^^2^Sac^^^2^^^1^^of^^^2Orig^Box^^^5 x_FaunalBox^^^^^^^0 Box (1-no #-1 of 2) Aa (9-a #-9 a 9) ^Box^^^1^^no^^^1^^of^^^2 x_FaunalBox^^^^^^^1 Box (1-no #-2 of 2) Broughton M-1 Aa (9-a #-9 a 9) Aa A-9 ^Box^^^1^^no^^^2^^of^^^2Broughton^^^M^^^1 x_FaunalBox^^^^^^^2 Box (11-no#-1 of 2) Aa (9-a#-9 a 9) ^Box^^11^^no^^^1^^of^^^2 x_FaunalBox^^^^^^^1 Box (11-no#-2 of 2) Aa (9-a#-9 a 9) ^Box^^11^^no^^^2^^of^^^2 x_FaunalBox^^^^^^^2 1-___ CA-Mrn-193 duplicates 9-___ A-Aa-9 a ^^^1^___^^CA^Mrn^193duplicates ^^^^^1CA^^^^Mrn^^^^193duplicates 1-___ CA-SFr-2500 Mason St. duplicate nums 9-___ A-Aa-9 Aa Aa. a a ^^^1^___^^CA^SFr2500Mason^^Stduplicatenums ^^^^^1CA^^^^SFr^^^2500Mason^^St^duplicate^nums
Hasty perl script to aid in prototyping structuredObjectNumber algorithm
execute as "perl -n sortable.pl testcases.csv". testcases.csv and this script are attached.
This script is retained for posterity, but bears only a modest resemblance to what is implemented in
the javascript above and what is defined in the algorithm.
# parse input line, clean up a bit chomp; s/ +$//; s/ +\t/\t/; my ($original,$sortNum) = split /\t/; $_ = $original; # make pattern s/\d+/9/g; s/[A-Z]+/A/g; s/[a-z]+/a/g; my $pattern = $_; # generate sortkey my $sortkey; my ($left,$right,$extra); ($left,$right,$extra) = $original =~ /^([\d\w]+)[\-\.](\d+)(.*)$/; # basic pattern $extra =~ s/^[\s\_\-\.\;\:\,\+\/\#]+//; # remove initial non-filing chars (i.e. punctuation) # NB: this may be problematic in some cases. if ($left && $right) { if ($left =~ /^[0-9]+$/) { $sortkey = sprintf "%04d-%06d %s", ($left+0,$right,$extra); } else { # handle case where leftmost token is alphabetic... $sortkey = sprintf "%-4s-%06d %s", ($left,$right,$extra); } } else { # fallback : see prose above my $string = $original; $string =~ s/([a-zA-Z]+)([0-9]+)/\1 \2/g; # handle both cases: letter first $string =~ s/([0-9]+)([a-zA-Z]+)/\1 \2/g; # digits first $string =~ s/\W/ /g; @tokens = split(/[\s\-\.\:\,\-\+\/\#]+/,$string); my @paddedTokens = map { sprintf "%4s",$_ } @tokens; $sortkey = join '',@paddedTokens; } $sortkey =~ s/ /\^/g; # add ^ to make blanks visible $sortNum =~ s/ /\^/g; # add ^ to make blanks visible printf "%s\t%s\t%s\t%s\n", ($original,$pattern,$sortkey,$sortNum) ;