Sorting and Searching PAHMA Object Numbers

Analysis of the data in TMS and the TMS approach to sorting

PAHMA Object Numbers (also known as "Museum Numbers" in the PAHMA UI and elsewhere) are unique string identifiers, one per object or subobject.

Most of the identifiers (88.9%) follow a simple pattern: a short sequence of digits, a hyphen, and another string of digits, e.g. "1-123456". And 94.1% follow a slight variation: this pattern plus an "excresence" in the form of a few letters, e.g. "1-23456a,b". The remaining 5.9% (about 48,500 objects) fall into one of about 250 more varied patterns. The most frequent patterns are show in Table 1 below (up to a cumulative frequency of 99%).

"A" below means "a sequence of capital letters"
"a" below means "a sequence of lowercase letters"
"9" below means "a sequence of digits"

Table 1: Frequency Distribution of ObjectNumber ("Museum Number") Patterns in TMS

n=647,902

Pattern	Frequency	Relative F.	Cumulative F.	Example
9-9	575744	88.86%	88.86%	1-2345
A-9	22974	3.55%	92.41%	TB-192
9-9a	7857	1.21%	93.62%	1-45678b
9-9a,a	7085	1.09%	94.72%	1-47899a,b
9-9_a	6441	0.99%	95.71%
Aa.9	4705	0.73%	96.44%	Box.99
9.9-9	3776	0.58%	97.02%
9.9	3638	0.56%	97.58%
9-9a-a	3450	0.53%	98.11%
9-9.A#9.A	1753	0.27%	98.38%
A-A9	1723	0.27%	98.65%
9-9.A#9.A.9	892	0.14%	98.79%
Aa.9A	886	0.14%	98.92%
A-9a	801	0.12%	99.05%

New and Proposed Museum Number types

Going forward, in CSpace, PAHMA will begin to assign numbers using the "Number Patterns" provided in the UI. Some of these patterns do occur, infrequently, in the existing TMS database. Table 2 shows the various prefixes and numbering patterns that are to be used.

Table 2: Museum Number Patterns to be available in CSpace

Pattern	Description	Remarks
HMA 2012.1.2	Cataloging
LI 2012.1.2	In Loan
LO 2012.1.2	Out Loan
ACC 2012.1.2	Accession
	Inventory/Movement
EX 2012.1.2	Object Exit
IN 2012.1.2	Object Entry
CL 2012.1.2	NAGPRA Claim
	Group
	Media
Patterns for some of the not-yet-implemented procedures
CR 2012.1.2	Condition Checking
INS 2012.1.2	Insurance
VAL 2012.1.2	Valuation
EXH 2012.1.2	Exhibition
USE 2012.1.2	Use of Collections
TR 2012.1.2	Treatment

A few points about how these should be represented and collated:

Some of the patterns above are not going to be used as Museum Numbers; there is no reason therefore that they should collate like them. However, they will need to collate amongst themselves, and so will eventually deserve similiar (and hopefully identical) treatment.
A space between the alphabetic prefix and the number is desired, but the sortkey algorithm should accommodate numbers without a space.
The key of these types of numbers should sort in 4 levels, the first alphabetic, the next 3 numeric.
While the prefixes could be enumerated (they will be in the Number Patterns UI widget), it would be nice if the algorithm could handle new prefixes without requiring the code to be updated.

TMS Sort Key Generation

TMS employs an algorithm which generates a sort key from the input ObjectNumber value. This value is stored in the SortNumber column of the Objects table and used for collation and perhaps other purposes. The value is configurable by the user in consultation with TMS staff. The TMS manual contains the following configuration information:

Field	Example	Description
SortNumber.PartOrder	1.2.3	Specifies the order in which the different parts of the SortNumber Template should be joined.
SortNumber.Template	XXXXXX.XXXXXX.X XXXX.XXXXX.XXXX .XXX.XXX.XXXXX.X XXXX.XXXXX.XXXX X.XXXX	The template for object numbers (DO NOT CHANGE this setting without consulting with Gallery Systems support staff; see “Updating SortNumber Templates” on page 26-26)

TMS Users' Manual 26-14

(Note that this same template and instruction is used in TMS in a variety of places besides ObjectNumber.)

Options for PAHMA structuredObjectNumber in CSpace

Two options for migrating and maintaining a sortable value for objectNumbers in CSpace:

Migrate the existing values of the SortNumber field to CSpace and reverse engineer the TMS method for creating them in CSpace. There is little value and some risk in doing this: since new ObjectNumbers will continue to be created, whatever algorithm (and implementation) is devised for CSpace it can be employed as part of the migration process to create new sortNumbers. Note also that the TMS implementation appears to have some problems and not all the values in the existing SortNumber field seem to be appropriate. Furthermore, implementation details for the TMS procedure are not included in the documentation, and may be hard to come by; at any rate, there is a risk that the re-implementation might not match the current implementation in some significant detail.
Devise a new algorithm and implementation, hopefully relatively general and straightforward, that can be used to both re-generate sort keys as part of migration and to create new sort keys in the future.

The proposal for the second option is presented in more detail below.

Recommendations for PAHMA structuredObjectNumber in CSpace

It is desired to convert the objectNumber value into a scalar value which supports collation and range searching. This could result in a numeric or hash value being produced, but having a human-readable key that bears a structural resemblance to the original has some advantages.

As noted above, almost 95% of the existing objectNumbers in TMS match a relatively simple regex, and that this expression can be modified to handle the new _number patterns_. The remaining identifiers are pretty much all over the map, but could also be collated using a relatively simply algorithm (see below), and one which might be robust enough to serve as a general template for creating such sort keys in CSpace.

The main pattern has up to 5 parts:

PREFIX NUMBER NUMBER NUMBER ETC

Therefore, a two part algorithm is suggested:

If the identifier matches the "main" 5-part pattern:
```
/ /^([cC](ons|ONS)?[\-\. ]?)?([A-Z]+)?([\-\. ])?(\d+)([\-\. ])?(\d+)([\.\- ]+)?(\d+)?([\.\- ]+)?(.*)$/
```
we are nearly done: Ignore the first match group (the optional "component prefix" C. and its variants) and concatenate the five captured tokens (excluding punctuation) with blank as separator (padding to length 6 with zeroes if the token is numeric). If the first element is alphabetic, include it, otherwise omit it.
Note that this suggestion has the nice property that the original value is easily retrieved from this sort key, which it closely resembles.
Another fine point to note is that some "non-standard" input strings match the pattern (esp. cases which have an alphabetic first element and a numeric second element) and these get reformatted. This should be regarded as a feature not a bug since in the vast majority of such cases the result is an improved collation of these values, and it is unlikely that they will incorrectly collate with standard values. However, removing the \w from the third captured element (first output element) would element most of these from reformatting.
For the remaining 5% that do not match this pattern, three options are possible. For PAHMA, it appears that the "pathological cases" below do not need to be handled, so "leave them as is" is fine as the ELSEWHERE case for objectNumbers that do not match the main pattern.!

Javascript snippet implementing algorithm suggested above (includes test cases)

pad = function(s, len){
  if (len-s.length+1 > 0) {
    return (new Array((len-s.length+1) ).join("0")) + s;
  }
  else {
    return s;
  }
};

isNumber = function(n) {
  return !isNaN(parseFloat(n)) && isFinite(n);
}

function testValues() {
  var resultString = "<table>";
  var objnums =
  { 'Do Not Touch!':0,
    'xxx,xxx 99':0, 
    'HMA 2012.1.20 b':4,
    '2012.10.20 b':4,
    '2012.1':4,
    'HMA 2012.01.20 b':4,
    'LI 2012.11.200':4,
    'LO 2012.1.20':4,
    'LO 2012.1.200':4,
    'LO2012.1.19':4,
    'LO 2012.1.20a':4,
    'ACC 2012.1.2':4,
    'ACC 2012.1.02x':4,
    'c-1-210':4,'C.99-1234-a':6,
    'C99-1234.9':6,'cons99-1234.92':6,
    'C 99-1234.93':6,'cons 99-1234.94':6,
    'CONS.99-1234-a99':6,
    'Cons.1-10011':1,'Cons1-10011':1,
    'IJ 1':1,
    'IJ-1':1,
    'K-1993m':1,
    'L 54.739.2':1,
    'L-16576e,f,h,i':1,
    '1-2':1,'1-200a':3,
    '1.3 a,b':2,'99-12345678912345 xx':5}
  for (objnum in objnums) {
    // console.log(objnum,createStructuredObjectNumber(objnum));
    resultString += "<tr><td>" + objnum + "<td>" + createStructuredObjectNumber(objnum);
    }
  resultString += "</table>";
  document.getElementById('results').innerHTML = '<pre>' + resultString + '</pre>';
}
function createStructuredObjectNumber(objnum) {
  //            1    2                   3        4         5    6         7    8          9     10         11   
  var objRe = /^([cC](ons|ONS)?[\-\. ]?)?([A-Z]+)?([\-\. ])?(\d+)([\-\. ])?(\d+)?([\.\- ]+)?(\d+)?([\.\- ]+)?(.*)$/;
  var objTokens = objRe.exec(objnum);
  if (objTokens == null) {
    return objnum;
  }
  else {
    for ( i = 0 ; i < objTokens.length ; i = i+1 ) {
      if (!objTokens[i]) {
        objTokens[i] = '';
      }
      else {
        if (isNumber(objTokens[i])) {
          objTokens[i] = pad(objTokens[i],6);
        }
      }
      objTokens[i] = objTokens[i] + ' ';
    }
    if (objTokens[3] == ' ') objTokens[3] == ''; // zap empty alphabetic prefix
    return objTokens[3]+objTokens[5]+objTokens[7]+objTokens[9]+objTokens[11];
  }
}

Some "Ordinary" Cases

(most of these match the "main pattern", though in some cases the resulting key is not strictly collatable)

ORIGINAL	PROPOSED KEY
Do Not Touch!	Do Not Touch!
xxx,xxx 99	xxx,xxx 99
HMA 2012.1.20 b	HMA 002012 000001 000020 b
2012.10.20 b	002012 000010 000020 b
2012.1	002012 000001
HMA 2012.01.20 b	HMA 002012 000001 000020 b
LI 2012.11.200	LI 002012 000011 000200
LO 2012.1.20	LO 002012 000001 000020
LO 2012.1.200	LO 002012 000001 000200
LO2012.1.19	LO 002012 000001 000019
LO 2012.1.20a	LO 002012 000001 000020 a
ACC 2012.1.2	ACC 002012 000001 000002
ACC 2012.1.02x	ACC 002012 000001 000002 x
c-1-210	000001 000210
C.99-1234-a	000099 001234 a
C99-1234.9	000099 001234 000009
cons99-1234.92	000099 001234 000092
C 99-1234.93	000099 001234 000093
cons 99-1234.94	000099 001234 000094
CONS.99-1234-a99	000099 001234 a99
Cons.1-10011	000001 010011
Cons1-10011	000001 010011
IJ 1	IJ 000001
IJ-1	IJ 000001
K-1993m	K 001993 m
L 54.739.2	L 000054 000739 000002
L-16576e,f,h,i	L 016576 e,f,h,i
1-2	000001 000002
1-200a	000001 000200 a
1.3 a,b	000001 000003 a,b
99-12345678912345 xx	000099 12345678912345 xx

More "Pathological" Cases

NB: the following sections are superceded, but is retained here to remind us to check these pathological cases at some point.

A more general algorithm, that can collate mixed strings of number, letters, and punctuation pretty well is the following, preserved here purely for speculative interest:

perform a two-step tokenization and reassembly of the parts as follows:
- insert whitespace between each juncture of letter and digit:
```
s/([a-zA-Z]+)([0-9]+)/\1 \2/g;
  s/([0-9]+)([a-zA-Z]+)/\1 \2/g;
```
- split the identifier into tokens on white space and selected puncuation, i.e.
```
split("[\s\_\-\.\;\:\,\-\+\/\#]+")
```
- reassemble the tokens, with padding to fixed lengths, see script below.

These are handled by the "elsewhere" case in the script below. MTB says that most of these are ephemeral, apocryphal, or otherwise moot (i.e. they are accession numbers or "virtual objects" that do not refer to real objects, despite being in the ObjectNumber column in Objects). Nevertheless, we would be wise to spend just a little while considering identifiers like these.

ORIGINAL              PATTERN           PROPOSED KEY            TMS SORTNUM


 IJ 1                                   IJ-000001
 IJ-1                                   IJ-000001
 K-1993m                                K-001993^m
 L 54.739.2                             L-000054^739.2
 L-16576e,f,h,i                         L-016576^e,f,h,i

100.1-145272        	9.9-9          	0100-000001^145272  	^^^100^^^^^114527^^^^2
100.1-145274        	9.9-9          	0100-000001^145274  	^^^100^^^^^114527^^^^4
100.1-145275        	9.9-9          	0100-000001^145275  	^^^100^^^^^114527^^^^5
1-154197.dup        	9-9.a          	0001-154197^dup     	^^^^^1154197dup
1-205632_dupe       	9-9_a          	0001-205632^dupe    	^^^^^1205632dup
1-217004_dupe       	9-9_a          	0001-217004^dupe    	^^^^^1217004
1-217005_dupe       	9-9_a          	0001-217005^dupe    	^^^^^1217005
1-217006_dupe       	9-9_a          	0001-217006^dupe    	^^^^^1217006
14-1                	9-9            	0014-000001^        	^^^^14^^^^^1
14-10.1             	9-9.9          	0014-000010^1       	^^^^14^^^^10^^^^1
14-10.1.dup.1       	9-9.9.a.9      	0014-000010^1.dup.1 	^^^^14^^^^10^^^^1dup^^^^^1
14-10.1.dup.2       	9-9.9.a.9      	0014-000010^1.dup.2 	^^^^14^^^^10^^^^1dup^^^^^2
14-100              	9-9            	0014-000100^        	^^^^14^^^100
14-1000a.1          	9-9a.9         	0014-001000^a.1     	^^^^14^^1000a^^^^^^^^1
14-1000a.1.dup.1    	9-9a.9.a.9     	0014-001000^a.1.dup.1	^^^^14^^1000a^^^^^^^^1dup^^^1
14-1000b.2          	9-9a.9         	0014-001000^b.2     	^^^^14^^1000b^^^^^^^^2
14-1000b.2.dup.1    	9-9a.9.a.9     	0014-001000^b.2.dup.1	^^^^14^^1000b^^^^^^^^2dup^^^1
14-1000c.3          	9-9a.9         	0014-001000^c.3     	^^^^14^^1000c^^^^^^^^3
1-TEMP01a           	9-A9a          	^^^1TEMP^^01^^^a    	^^^^^1TEMP^^^^^^1a
1-TEMP01b           	9-A9a          	^^^1TEMP^^01^^^b    	^^^^^1TEMP^^^^^^1b
1-TEMP02            	9-A9           	^^^1TEMP^^02        	^^^^^1TEMP^^^^^^2
1-TEMP03            	9-A9           	^^^1TEMP^^03        	^^^^^1TEMP^^^^^^3
1168.Crem 1         	9.Aa 9         	1168Crem^^^1        	^^1168Crem^^^^^^1
1175.Crem 5         	9.Aa 9         	1175Crem^^^5        	^^1175Crem^^^^^^5

24-2382.S#12.A and B	9-9.A#9.A a A  	0024-002382^S#12.A^and^B ^^^^24^^2382S^^^^^^^12A^^^andB
4-"Blank" Box (1 of 3) Broughton 9-"Aa" Aa (9 a 9) Aa ^^^4Blank^Box^^^1^^of^^^3Broughton x_FaunalBox^^^^^^^1^broughton
4-"Blank" Box (2 of 3)	  9-"Aa" Aa (9 a 9)             ^^^4Blank^Box^^^2^^of^^^3          x_FaunalBox^^^^^^^2^broughton
Box #1 (Sac #1?) (2 of 2) Aa #9 (Aa #9?) (9 a 9)	^Box^^^1^Sac^^^1^^^2^^of^^^2     x_FaunalBox^^^^^^^1^2
Box #2 (Sac #2?) (1 of 2) Orig. Box 5	Aa #9 (Aa #9?) (9 a 9) Aa. Aa 9	^Box^^^2^Sac^^^2^^^1^^of^^^2Orig^Box^^^5	x_FaunalBox^^^^^^^0
Box (1-no #-1 of 2) 	Aa (9-a #-9 a 9)	^Box^^^1^^no^^^1^^of^^^2	x_FaunalBox^^^^^^^1
Box (1-no #-2 of 2) Broughton M-1	Aa (9-a #-9 a 9) Aa A-9	^Box^^^1^^no^^^2^^of^^^2Broughton^^^M^^^1	x_FaunalBox^^^^^^^2
Box (11-no#-1 of 2) 	Aa (9-a#-9 a 9)	^Box^^11^^no^^^1^^of^^^2	x_FaunalBox^^^^^^^1
Box (11-no#-2 of 2) 	Aa (9-a#-9 a 9)	^Box^^11^^no^^^2^^of^^^2	x_FaunalBox^^^^^^^2
1-___ CA-Mrn-193 duplicates	9-___ A-Aa-9 a 	^^^1^___^^CA^Mrn^193duplicates	^^^^^1CA^^^^Mrn^^^^193duplicates
1-___ CA-SFr-2500 Mason St. duplicate nums	9-___ A-Aa-9 Aa Aa. a a	^^^1^___^^CA^SFr2500Mason^^Stduplicatenums	^^^^^1CA^^^^SFr^^^2500Mason^^St^duplicate^nums

Hasty perl script to aid in prototyping structuredObjectNumber algorithm

execute as "perl -n sortable.pl testcases.csv". testcases.csv and this script are attached.

This script is retained for posterity, but bears only a modest resemblance to what is implemented in
the javascript above and what is defined in the algorithm.

# parse input line, clean up a bit
chomp;
s/ +$//;
s/ +\t/\t/;
my ($original,$sortNum) = split /\t/;
$_ = $original;

# make pattern
s/\d+/9/g;
s/[A-Z]+/A/g;
s/[a-z]+/a/g;
my $pattern = $_;

# generate sortkey
my $sortkey;
my ($left,$right,$extra);
($left,$right,$extra) = $original =~ /^([\d\w]+)[\-\.](\d+)(.*)$/; # basic pattern
$extra =~ s/^[\s\_\-\.\;\:\,\+\/\#]+//; # remove initial non-filing chars (i.e. punctuation)
                                        # NB: this may be problematic in some cases.
if ($left && $right) {
  if ($left =~ /^[0-9]+$/) {
    $sortkey = sprintf "%04d-%06d %s", ($left+0,$right,$extra);
  }
  else { # handle case where leftmost token is alphabetic...
    $sortkey = sprintf "%-4s-%06d %s", ($left,$right,$extra);
  }
}
else { # fallback : see prose above
  my $string = $original;
  $string =~ s/([a-zA-Z]+)([0-9]+)/\1 \2/g; # handle both cases: letter first
  $string =~ s/([0-9]+)([a-zA-Z]+)/\1 \2/g; #                    digits first
  $string =~ s/\W/ /g;
  @tokens = split(/[\s\-\.\:\,\-\+\/\#]+/,$string);
  my @paddedTokens = map { sprintf "%4s",$_ } @tokens;
  $sortkey = join '',@paddedTokens;
}
$sortkey =~ s/ /\^/g; # add ^ to make blanks visible
$sortNum =~ s/ /\^/g; # add ^ to make blanks visible
printf "%s\t%s\t%s\t%s\n", ($original,$pattern,$sortkey,$sortNum) ;