You are given a string paragraph and an array of the banned words.
Write a script to return the most popular word that is not banned. It is guaranteed there is at least one word that is not banned and the answer is unique. The words in paragraph are case-insensitive and the answer should be in lowercase. The words can not contain punctuation symbols.
The solution is pretty straightforward:
We will be reading input from standard input; each line a different test.
Each line contains a //, with the paragraph preceeding it, and the list
of banned words follows the //.
We will start off by parsing the input, separating the paragraph and the
banned words, and putting the latter into a hash (%banned):
my ($paragraph, $banned) = split m!\s*//\s*!;
my %banned = map {$_ => 1} lc ($banned) =~ /\pL+/g;
Splitting the list of banned words on white space would have been an option,
but here we extract them by search for sequences of letters: /\pL+/g.
We then initialize the hash (%ok) which we will use to tally the
unbanned words. We add a special entry named $max with value 0
to avoid using an additional variable later on:
my %ok = ((my $max = "") => 0);
We can now extract the words from the paragraph, again using /\pL+/g,
check whether they are banned, and if not, tally them in the %ok hash.
This is a one liner:
$ok {$_} ++ for grep {!$banned {$_}} lc ($paragraph) =~ /\pL+/g;
Finally, we find the most frequently used word by iterating over the
%ok hash, and keeping track of the best seen so far:
$ok {$_} > $ok {$max} && ($max = $_) for (keys %ok);
This leaves the answer in $max.
Find the full program on GitHub.
Tcl code looks very different from Perl code, but we still use almost the
same algorithm. Here, we also start with parsing the input, but we need
a trick. Tcl does have a split function, but it can only split on single
characters. So, we first replace the // in the input by a NUL character
("\x{00}"), then split on the NUL character:
lassign [split [string map {// \0} $line] \0] paragraph banned_str
In Tcl, you don't have a split on white space, you can just iterate over
a string; so we do that to populate the banned dictionary:
set banned [dict create]
foreach ban [string tolower $banned_str] {
dict set banned $ban 1
}
We now replace all sequences of non-letters in the paragraph by a single
space, so we're left with just the words separated by white space:
regsub -all {[^[:alpha:]]+} [string tolower $paragraph] { } paragraph
We initialize a dictionary, and populate it with a special value (the empty
string) and a count of 0, like with did in the Perl solution:
set ok [dict create {} 0]
Counting the unbanned words:
foreach word $paragraph {
if {![dict exists $banned $word]} {
dict incr ok $word
}
}
Finally, find the most frequently occuring word:
set max {}
dict for {word count} $ok {
if {$count > [dict get $ok $max]} {
set max $word
}
}
Find the full program on GitHub.
C doesn't have hashes, nor does it have much support for string
manipulation — strings just being NUL terminated arrays of chars.
We first parse the input. We will use two arrays words, and banned,
consisting of the lower cased words in the paragraph, and the banned
words. We will use a trick: the current read line is pointed to by
the char * pointer line. Instead of copying parts of the string,
we'll just insert NUL characters after the end of word, and put
pointers to the first characters of the words in the words and banned
arrays.
We first need to allocate memory for the words and banned arrays —
initially empty, but we'll expand when necessary:
if ((words = (char **) malloc (0 * sizeof (char *))) == NULL) {
perror ("malloc failed");
exit (1);
}
if ((banned = (char **) malloc (0 * sizeof (char *))) == NULL) {
perror ("malloc failed");
exit (1);
}
We then iterate over the input, and for each character, do the following:
1. Lower case the character
2. If the current character is a letter, but the previous one wasn't,
this is the start of a word. Put it either in the words or banned
arrays, after increasing their size.
3. If the character is a /, and the character is a / as well,
switch from parsing the paragraph to parsing the banned words.
4. If the character isn't a letter, make it a NUL character. This makes
that all the words found are proper C strings.
size_t words_c = 0;
size_t banned_c = 0;
bool parse_banned = false;
bool prev_is_letter = false;
for (char * ptr = line; * ptr; ptr ++) {
* ptr = tolower (* ptr);
if (isalpha (* ptr)) {
if (!prev_is_letter) {
if (parse_banned) {
if ((banned = (char **) realloc (banned,
++ banned_c * sizeof (char *))) == NULL) {
perror ("Realloc failed");
exit (1);
}
banned [banned_c - 1] = ptr;
}
else {
if ((words = (char **) realloc (words,
++ words_c * sizeof (char *))) == NULL) {
perror ("Realloc failed");
exit (1);
}
words [words_c - 1] = ptr;
}
}
prev_is_letter = true;
}
else {
if (!parse_banned) {
if (* ptr == '/' && * (ptr + 1) == '/') {
parse_banned = true;
}
}
* ptr = NUL;
prev_is_letter = false;
}
}
Next step is to eliminate the banned words from the words array.
Banned words will be replaced by NULL. We need to iterate over
the banned array to know whether the word is banned.
for (size_t i = 0; i < words_c; i ++) {
for (size_t j = 0; j < banned_c; j ++) {
if (!strcmp (words [i], banned [j])) {
words [i] = NULL;
break;
}
}
}
Finally, we count how often each word occurs, keeping track of the most
frequently occuring word, by using a double loop over the words array.
char * max;
size_t max_count = 0;
for (size_t i = 0; i < words_c; i ++) {
size_t count = 0;
if (words [i] == NULL) {
continue;
}
for (size_t j = i; j < words_c; j ++) {
if (words [j] != NULL) {
if (!strcmp (words [i], words [j])) {
count ++;
}
}
}
if (count > max_count) {
max = words [i];
max_count = count;
}
}
Find the full program on GitHub.
We also have solutions in AWK, Bash, Go, Lua, Node.js, Python, R, and Ruby.