Perl Weekly Challenge 135: Validate SEDOL

by Abigail

Challenge

You are given 7-characters alphanumeric SEDOL.

Write a script to validate the given SEDOL. Print 1 if it is a valid SEDOL otherwise 0.

For more information about SEDOL, please checkout the wikipedia page.

Example 1

Input: $SEDOL = '2936921'
Output: 1

Example 2

Input: $SEDOL = '1234567'
Output: 0

Example 3

Input: $SEDOL = 'B0YBKL9'
Output: 1

Discussion

The SEDOL Wikipedia page explains SEDOL codes consists of six alpha numerical characters, followed by a check digit. Although not explicitly mentioned, all the examples, including the example code, assume the letters are capitals. Vowels are not used. (Vowels in this context means the letters A, E, I, O, and U; Y is not a vowel here.)

To calculate the checksum, we first have to map each character to a value. Digits map to themselves; letters map consecutively, with A mapping to 10, up to Z mapping to 36. For mapping purposes, we do not skip the vowels. We then multiply values we got after mapping the characters by different weights, depending on their position:

A SEDOL code is valid if the sum of the multiplied values is evenly divisible by 10.

Solutions

Note that since we are only interested in having a check sum which is divisible by 10, everything will also work out if we map A to Z to 0 to 25. We will do this for most solutions.

Each solution will read SEDOL codes from standard input, one code per line, outputting a 0 or 1 for each code, depending whether the code is valid or not.

Perl

We first start off with creating a hash which we can use to map characters to values:

my %c = do {my $c = 0; map {$_ => $c ++} 0 .. 9, 'A' .. 'Z'};

Given that, the rest is just a (long) one-liner:

say /^((?[[0-9A-Z]-[AEIOU]]))@{["((?1))" x 5]}([0-9])$/x &&
($c {$1} + 3 * $c {$2} + $c {$3} + 7 * $c {$4} + 3 * $c {$5} + 9 * $c {$6} +
 $c {$7}) % 10 == 0 || 0 while <>;

Let's have a closer look at the regexp. It starts off with:

/^                 # Match at the beginning of the string
 (                 # Start of capture group 1
   (?[             # Perform character class arithmetic
     [0-9A-Z]      # Character class matching a digit, or an uppercase letter
        -          # ... except ...
     [AEIOU]       # vowels
   ])              # End of character class arithmetic
 )                 # End of capture group 1

So this matches the first character of a SEDOL code, and, if the entire pattern matches, it makes the first character available in $1.

The next part of the pattern is:

@{["((?1))" x 5]}

What is going on here? First, we have the baby cart operator, @{[ ]}. It evaluates the expression inside the brackets (in list context), and interpolates the result in the string (or regexp in this case).

Inside the brackets, we have:

"((?1))" x 5

This is the repetition operator (x). It takes the string on the left of it, and repeats that as many times as the value on its right. So, this results in fives times ((?1)).

Since this is interpolated in the regexp. Now, what does ((?1)) mean in a regexp? The outer parenthesis are capturing parethesis. So this is capturing $2, $3, $4, $5, and $6.

But what is (?1)? This is a recursive subpattern. It means to repeat the same pattern as in the first capture group. So, all this means we're capturing the next five characters, as long as they are valid SEDOL characters (else, the entire match will fail).

The final part of the regexp is simple:

([0-9])$

It matches a digit, followed by the end of the string. The matched digit is going to be placed in $7.

The regular expression matches if and only if it is matched against a syntactical valid SEDOL (that is, exactly seven characters, all of the allowed alpha numerics). And if it matches, each of its seven characters are available in $1 to $7.

We can now easily calculate the checksum:

    $c {$1} + 3 * $c {$2} + $c {$3} + 7 * $c {$4} +
3 * $c {$5} + 9 * $c {$6} + $c {$7}

which we then check if it is divisible by 10: % 10 == 0.

Find the full program on GitHub.

AWK

First we create an array w with weights, and an array c which maps characters to values:

BEGIN {
    ord_0 = 48
    ord_9 = 57
    ord_A = 65
    ord_Z = 90
    split ("1 3 1 7 3 9 1", w, " ")
    for (i = ord_0; i <= ord_9; i ++) {
        c [sprintf ("%c", i)] = i - ord_0
    }
    for (i = ord_A; i <= ord_Z; i ++) {
        c [sprintf ("%c", i)] = i - ord_A
    }
}

Then we do some validation: each character should be a digit, or an upper case letters excluding vowels, the string should be exactly seven characters long, and the string should end with a digit:

/[^0-9BCDFGHJKLMNPQRSTVWXYZ]/ {print 0; next}
length ($0) != 7              {print 0; next}
/[^0-9]$/                     {print 0; next}

Now we calculate the check sum, and verify whether it is divisible by 10:

{
    check = 0
    for (i = 1; i <= 7; i ++) {
        check += w [i] * c [substr ($0, i, 1)]
    }
    print (check % 10 == 0 ? 1 : 0)
}

Find the full program on GitHub.

Bash

First, we create an array with weights, and some constants:

w=(1 3 1 7 3 9 1)

printf -v ord_0 %d "'0"
printf -v ord_A %d "'A"

With the input line in line, we can first so some basic validation:

if ((${#line} != 7))
then echo 0
elif [[ $line =~ [^0-9BCDFGHJKLMNPQRSTVWXYZ] ]]
then echo 0
elif [[ $line =~ [^0-9]$ ]]
then echo 0
else

If this all validates, we can calculate the check sum, and validate it:

((check = 0))
for ((i = 0; i < 7; i ++))
do  char=${line:$i:1}
    printf -v ord %d "'$char"
    if [[ $char =~ [0-9] ]]
    then ((value = ord - ord_0))
    else ((value = ord - ord_A))
    fi
    ((check += ${w[i]} * value))
done
if ((check % 10 == 0))
then echo 1
else echo 0
fi

Find the full program on GitHub.

C

An array of weights:

short w [] = {1, 3, 1, 7, 3, 9, 1};

We won't be using regular expressions in our C solution; instead we scan over the string checking each character.

Here, we have the input line in the variable line, and its length in a variable str_len. The line includes a newline.

char * line_ptr = line;
short valid = 1;
if (str_len == 8) {
    int check = 0;
    for (size_t i = 0; i < 7 && valid; i ++) {
        char first = 0;
        if ('0' <= line_ptr [i] && line_ptr [i] <= '9') {
            first = '0';
        }
        else {
            if ('B' <= line_ptr [i] && line_ptr [i] <= 'Z' &&
                       line_ptr [i] != 'E'                 &&
                       line_ptr [i] != 'I'                 &&
                       line_ptr [i] != 'O'                 &&
                       line_ptr [i] != 'U'                 &&
                       i < 6) {
                first = 'A';
            }
            else {
                valid = 0;
            }
        }
        check += (line_ptr [i] - first) * w [i];
    }
    if (check % 10 != 0) {
        valid = 0;
    }
}
else {
    valid = 0;
}
printf ("%d\n", valid);

Find the full program on GitHub.

Go

An array of weights:

var w = [] rune {1, 3, 1, 7, 3, 9, 1}

In Go a rune is a type for (Unicode) character values.

We pick up validation when text contains the line of input, trimmed of its newline. First, we check for syntax:

match, _ := regexp . MatchString (
    "^[0-9BCDFGHJKLMNPQRSTVWXYZ]{6}[0-9]$", text)

This validates if text contains 6 alpha-nums (without vowels), followed by a digit; else we fail and continue:

if !match {
    fmt . Print ("0\n")
    continue
}

Calculating and validating the check sum:

var check rune = 0
for i, rune := range text {
    if rune <= '9' {
        rune -= '0'
    } else {
        rune -= 'A'
    }
    check += w [i] * rune
}
if check % 10 == 0 {
    fmt . Print ("1\n")
} else {
    fmt . Print ("0\n")
}

Find the full program on GitHub.

Scheme

For our scheme solution, we start off with a function which takes a byte (in effect, an ASCII value), and which returns the value it is mapped to:

(define ords  (bytevector->u8-list (string->bytevector "09A" "UTF-8")))
(define ord-0 (list-ref ords 0))
(define ord-9 (list-ref ords 1))
(define ord-A (list-ref ords 2))
(define (byte->val b)
    (if (<= b ord-9)
        (-  b ord-0)
        (-  b ord-A)))

We start off with the three character string 09A, which we then turn into a bytevector in UTF-8 encoding. This is a superset of ASCII, and the three characters in the string are all ASCII characters.

string->bytevector turns the string into a bytevector, while bytevector->u8-list takes this bytevector and turns it into a list. This list is placed in the variable ords.

Next is setting three constants, ord-0, ord-9, and ord-A; these will hold the ASCII values for 0, 9, and A. list-ref is a function which takes two arguments, a list and an index. It will return the element on said index.

Finally, we define a function byte->val which takes a byte as argument (b), and returns the mapping value. If the byte is the value of a digit (<= b ord-9), we subtract the value of the digit 0: (- b ord-0), else we subtract the value of the letter A: (- b ord-A).

Defining an array of weights is easy:

(define w (list 1 3 1 7 3 9 1))

Next, we need a bunch of variables:

(define pat "^[0-9BCDFGHJKLMNPQRSTVWXYZ]{6}[0-9]$") ;; Pattern to match with
(define is-sedol)                                   ;; Match object
(define valid 0)                                    ;; Is the SEDOL valid?
(define check 0)                                    ;; Checksum
(define values)                                     ;; mapped values of SEDOL
(define sedol (read-line))                          ;; Line of input

We start with matching the input against the pattern. If it fails, it can never be a valid SEDOL:

(set! is-sedol (string-match pat sedol))

If the match succeeds, we will take the characters, map them to their ASCII values, and then, using the byte->val function above, calculate their mapped values:

(if (regexp-match? is-sedol)
    (begin
        (set! values
            (map-in-order byte->val
                (bytevector->u8-list
                    (string->bytevector
                        (match:substring is-sedol 0) "UTF-8"))))

match:substring returns a capture of the match, with the zeroth capture being the entire match. (This is how we get rid of the newline which is in sedol). string->bytevector and bytevector->u8-list gives us a list of ASCII values (similar to what is explained above). map-in-order takes two arguments, a function and a list: it applies the function to each element of the list. Finally, the set! assigns this to values.

We can now calculate the checksum without an explicit loop:

(set! check
    (fold (lambda (weight val sum) (+ sum (* weight val))) 0 w values)
)

lambda creates an anonymous function. Here we have a function taking three arguments. It will multiply the first two, and add them to the third.

fold takes three or more arguments: a function, an initial value, and a list of lists. It will repeatedly call the function, with the next elements of the lists as parameters. The last parameter is the return value of the previous invocation of the function (and the initial value the first time it is called). The return value of the last invocation if the final value.

Now that we have the checksum, we can test whether it is divisible by 10. If it does, the input is a valid SEDOL:

(if (= 0 (modulo check 10)) (set! valid 1))

Printing the result is easy:

(display valid)(newline)

Find the full program on GitHub.

Other languages

We also have solutions in other languages, solutions which are very similar to one of the other solutions: Java, Lua, Node.js, Python, Ruby, and Tcl.


Please leave any comments as a GitHub issue.