Perl Weekly Challenge 369: Valid Tag

by Abigail

Challenge

You are given a given a string caption for a video.

Write a script to generate tag for the given string caption in three steps as mentioned below:

  1. Format as camelCase Starting with a lower-case letter and capitalising the first letter of each subsequent word. Merge all words in the caption into a single string starting with a #.
  2. Sanitise the String Strip out all characters that are not English letters (a-z or A-Z).
  3. Enforce Length If the resulting string exceeds 100 characters, truncate it so it is exactly 100 characters long.

Examples

Example 1

Input: $caption = "Cooking with 5 ingredients!"
Output: "#cookingWithIngredients"

Example 2

Input: $caption = "the-last-of-the-mohicans"
Output: "#thelastofthemohicans"

Example 3

Input: $caption = "  extra spaces here"
Output: "#extraSpacesHere"

Example 4

Input: $caption = "iPhone 15 Pro Max Review"
Output: "#iphoneProMaxReview"

Example 5

Input: $caption = "Ultimate 24-Hour Challenge: Living in a Smart Home controlled entirely by Artificial Intelligence and Voice Commands in the year 2026!"
Output: "#ultimateHourChallengeLivingInASmartHomeControlledEntirelyByArtificialIntelligenceAndVoiceCommandsIn"

Solution

The examples baffled me initially. In particular, the-last-of-the-mohicans doesn't result in any capital letters (so, - isn't seen as a word boundary), yet 24-Hour results in the H being capitalize, suggesting that - does break words.

In other to make sense of the examples, we don't follow the instructions as given, but use a different order:

  1. First remove all the non-letter, non-space characters.
  2. And only then sort out the capitalization.
  3. Finally, join everything together, and trim everything together.

Perl

For our Perl solution, we do all the modifications using regular expression. And we chain them, using the /r modifier.

We will do the following steps, in order:

  1. Remove all non-space, non-letter characters (s/[^\s\pL]+//gr)
  2. Remove leading white space (s/^\s+//r)
  3. Remove trailing white space (s/\s+$//r)
  4. Lower case the entire string (s/(.*)/\L$1/r)
  5. Capitalize each letter which follows white space, and delete that white space (s/\s+(\pL)/\u$1/gr)
  6. Add a leading # (s/^/#/r)
  7. Keep the first 100 characters, and delete the rest (s/.{100}\K.*//r).

This leads to the following program, where $_ contains the line we're processing:

say s/[^\s\pL]+//gr      =~   # Remove non-letters, but keep the spaces
    s/^\s+//r            =~   # Remove leading spaces
    s/\s+$//r            =~   # Remove trailing spaces
    s/(.*)/\L$1/r        =~   # All lower case
    s/\s+(\pL)/\u$1/gr   =~   # Capitalize first letter of each word,
                              # and delete the space proceeding it
    s/^/#/r              =~   # Add leading '#'
    s/.{100}\K.*//r           # Remove all characters after the 100th.

Find the full program on GitHub.

Tcl

Our solutions in most other languages follow the same steps as our Perl solution, but many don't have an equivalent to s/\s+(\pL)/\u$1/gr. Tcl does have a totitle function, which is similar to Perl's ucfirst.

This leads to the following program, where $input contains the input:

# * Remove non-letters, but keep space
# * Remove leading white space
# * Remove trailing white space
#
regsub -all          {[^[:alpha:]\s]+} $input {}               input
regsub               {^\s+}            $input {}               input     
regsub               {\s+$}            $input {}               input

#
# Lower case entire string
#
set input [string tolower $input]

#
# * For each sequence of letters (a word), upper case its first letter.
#   This upper cases the first letter of the string, hence
# * Lower case the first letter
# * Remove all white space
#
regsub -all -command {[[:alpha:]]+}    $input {string totitle} input
regsub      -command {^[[:alpha:]]}    $input {string tolower} input
regsub -all          {\s+}             $input {}               input

#
# Add leading '#', and print at most 100 characters
#
puts [string range "#${input}" 0 99]

The -command switch means the substitution part is seen as a piece of code, and executed (like the /e modifier in Perl), but it implicitly passes in the matched part.

Find the full program on GitHub.

C

C doesn't come with many functionality to manipulate strings. So, we will iterate over the string, skipping over the non-letters, and keeping track when we saw space.

Given the input in line, we use the following program:

# include <stdlib.h>
# include <stdio.h>
# include <stdbool.h>
# include <ctype.h>

char * ptr     = line;
bool saw_space = false;
short chars    = 0;

/*
 * Skip leading spaces, and non letters
 */
while (* ptr && !isalpha (* ptr)) {ptr ++;}

/*
 * Print the leading #, and first character
 */
printf ("#%c", tolower (* ptr ++));
chars += 2;
while (* ptr && chars < MAX_CHARS) {
    char ch = * ptr ++;
    if (isspace (ch)) {
        /*
         * If we saw a space, skip it, but remember we did see it
         */
        saw_space = true;
        continue;
    }
    if (!isalpha (ch)) {
        /*
         * If it's not a letter, skip it. Don't modify the
         * saw_space status
         */
        continue;
    }
    /*
     * We now have a letter. If we saw a space, print its
     * upper case, else print its lower case.
     * The saw_space status will be turned off
     */
    printf ("%c", saw_space ? toupper (ch) : tolower (ch));
    saw_space = false;
    chars ++;
}
printf ("\n");

Find the full program on GitHub.

Other Languages

We also have solutions in AWK, Bash, Go, Lua, Node.js, Python, R, Ruby and sed, all using more or less the same steps as our Perl and Tcl solutions.


Please leave any comments as a GitHub issue.