How it started

There’s a website called HackerRank that lets you challenge your coding skills (well, more your knowledge of certain techniques) and to compete against other. I thought it would be a good way to brush up on some skills and to continue learning. So down the rabbit hole I went.

The problem

Here’s the problem for anyone who wishes to play along at home. The problem asks you to search for tens of thousands of keywords in strings which could be long or short. Each time the keyword is found, even if part of another keyword, add its points (which change between searches), and report the minimum and maximum values for each search. Seems easy, right?

How did it go?

I originally attempted this problem and came to the conclusion I needed to use Tries without looking at the comments, but then realized that it was more complicated than that. I first attempted a home-rolled solution. This solution was a single Trie with all the matches configured. Every step in the string added a state to a list. This state list essentially operated the Trie on the same inputs at multiple locations, essentially creating a branched state machine, all running in sequence (though they could run in parallel), but independently. I thought it was clever, and it might still be. The solution in the next section is good, but it’s not multi-threadable.

After spending quite a bit of time getting it to work, the algorithm didn’t complete the tests in time, I started reading the discussions on Hacker Rank for what other people were saying. Some comments had mentioned “AC”, which after reading the older comments I came to understand was an acronym for “Aho-Corasick.” These are the last names of two engineers at Bell Labs in 1975.

The Aho-Corasick Journey

I had never heard of this algorithm before. So I Googled it, and boy was I supremely disappointed.

There are lots of terrible explanations of the algorithm online (I’m looking at you, Geeks for Geeks). Most has dense explanations and examples that didn’t match the explanations or included fields that were never used. I found only one decent article from TopTal which explained the KMP search algorithm as a bit of background and then explained the AC algorithm, but not overly well. So I tried Googling again.

Success! I found a paper from the ACM here for those who want to read it: “Efficient String Matching: An Aid to Bibliographic Search” (1975) by Alfred V. Aho and Margaret J. Corasick at Bell Laboratories. It’s remarkably easy to read and understand the algorithm. I’m stymied by the articles attempting to explain the original work and fail so badly to add any clarity to the algorithm at all. They should have just linked to the paper!

I spent about 4-5 hours reading over it very carefully and studying the proofs by induction as to why the longest suffix match correctly made progress. My original hunch was correct, but I was unable to formulate a proof as to why my original idea had worked and this was it.

Aside: Proofs

While taking my Algorithms course at USC, I found the material extremely difficult. The professor was a nice guy, but proofs by induction are difficult to understand, but once you formulate the arguments, it’s mathematically air-tight. I’m a bit disappointed that I haven’t really encountered algorithmic proofs in my line of work so far. They seem to be really important in making sure your complex state machine does what you really are intending it to do. But, I digress.

The Process

I had originally started writing the solution to the DNA Health question in Java. HackerRank supports Java 8. I got a working solution with AC. However, I got timeouts on most of the test cases. They limited the execution time and picked test cases that would run over this if you did it wrong?! Shenanigans!

Just kidding, it was still a good challenge. I ran a profiler in my favorite IDE, IntelliJ, and found that Java was taking 33% of the time doing garbage collection. I removed a HashMap I was using to cache pre-constructed AC state machines, reducing the GC time to just 20% of the application’s sampled run times. That’s still hot garbage, but it was an improvement. The sample case Input07 was still taking about 2 minutes 30 seconds to run on my local machine.

I then switched to using GoLang. I implemented a version of AC and posted it to my GitHub: go-aho-corasick-search. This reduced the search time to 30 seconds for Input07, again, using GoLang. I ran the profiler and about 50% of the time was spent doing Garbage Collection. This was a disappointing result, as I had spent a great deal of effort in the implementations to avoid heap-allocated data structures wherever possible. When allocation did occur, it was in slices. Nearly everything is a copy or is a stack-based reference to a copy.

Why you slow? Allocation and Garbage Collection

For the other 50% of the time, it was mostly performing memory allocations for the Tries for storing outputs. This was the only structure I could not predict the size of in advance of creating the Tries.

Of this 50%, only about 4-5% of the time was actually spent traversing the state machine. The issue is that the Trie construction process was taking an inordinate amount of time. I was creating state machines for each pass and this was not necessary using some very basic tricks.

Originally, I had created a new state machine for each subset of match inputs. The program spent all of its time building state machines and mopping up memory. I peeked at the Editorial, it was not very helpful. The code is nearly inscrutable. One thing I did pick up on what that the author wrote the entire thing in C++ as a single static state machine and re-used every structure. The author spent no time allocating structures.

With this knowledge, I stopped creating multiple state machines and just made one. The basic trick was to filter out any outputs that weren’t in the anticipated Gene set instead of creating a unique state machine for each test. Tries are very efficient little state machines.

More allocation issues

However, after running my profiler again, I noticed large amounts of time being consumed in the Emit method of my search algorithm. It was adding many results and that would, sometimes, create a new slice which needed all the previous values copied into it.

The solution was to just pre-allocate this to be large enough to hold the outputs of every test case. So I guessed. I picked progressively larger numbers for my pre-allocated slice sizes until Go was no longer performing grow operations (slice resize and copying the old values to the new slice). For the output of each node in the Trie, this was 5. I had experimented with larger numbers, but locked up my local machine when I ran out of memory. However, the output array was the magic one.

I incrementally tried higher and higher values: 100, 500, 1000, 10,000, 50,000 until it stopped allocating memory at 1,000,000 for sample 13, the most vexing.

Near success

I refactored to re-use every structure and run-times are down to 2s locally. All but test case 13 passed. It failed with a timeout error. It runs in about 3.18 seconds locally, with the CPU profiler enabled. This was still not enough to get the HackerRank problem to pass in time. Input 13 is still the only one did not pass in time. 1 out of 31 isn’t bad.

Tips & Tricks

Pro-tip: You need to shave every single second off of this one. Pull out every optimization trick you have. Don’t make it readable, make it fast. Pro-tip: Pre-allocate every buffer and output variable (and some will need to be quite massive).

Head’s up: the boiler-plate provided for GoLang is broken. Strings.Split will only read up to 1025 entries by default in GoLang. You need to completely replace it with something else. I used a BufferedReader with Go’s built-in text/scanner Scan object. It’s not only much faster, but more memory efficient as well.

Solution

Here’s my solution. It’s not pretty, and I hacked up the Github library to experiment with optimizations.

package main

import (
    "bytes"
    "bufio"
    "errors"
    "fmt"
    "io"
    "math"
    "os"
    "strconv"
    "text/scanner"
)


type stateFifo struct {
  items   []stateIndex
  readPos int
}

func (f *stateFifo) Push(i stateIndex) {
  if f.items == nil {
    f.items = make([]stateIndex, 0, 10)
  }
  f.items = append(f.items, i)
}

func (f stateFifo) Peek() (i stateIndex, ok bool) {
  if f.items == nil {
    return -1, false
  }
  if f.IsEmpty() {
    return -1, false
  }
  return f.items[f.readPos], true
}

func (f *stateFifo) Pop() {
  f.readPos++
  if f.IsEmpty() {
    f.Reset()
  }
}

func (f stateFifo) IsEmpty() bool {
  return f.readPos >= len(f.items)
}


func (f *stateFifo) Reset() {
  f.readPos = 0
  f.items = f.items[0:0]
}
  
type InvalidCharsetError struct {
  Char rune
}

func newInvalidCharsetError(char rune) error {
  return &InvalidCharsetError{
    Char: char,
  }
}

func (e *InvalidCharsetError) Error() string {
  return fmt.Sprintf("encountered rune: '%c' but it is not supported by this machine", e.Char)
}

type stateIndex int64

type vertexer interface {
  nextState(edge int64) stateIndex
  setNextState(edge int64, si stateIndex)
  setInvalidEdgesTo(si stateIndex)
}


type vertex struct {
  failState stateIndex
  output    []int
}

func newVertex() vertex {
  return vertex{
    failState: invalidState,
    output:    nil,
  }
}

func (v *vertex) setFailState(state stateIndex) {
  v.failState = state
}

func (v *vertex) appendOutputIndex(i []int) {
  if v.output == nil {
    v.output = make([]int, 0, 5)
  }
  v.output = append(v.output, i...)
}

func (v *vertex) outputs() []int {
  if v.output == nil {
    return make([]int, 0, 0)
  }
  return v.output
}

func (v vertex) hasOutput() bool {
  return v.output != nil && len(v.output) != 0
}

type vertexDense struct {
  vertex
  edges []stateIndex
}

func newVertexDense(numberOfStates int) vertexDense {
  v := vertexDense{
    vertex: newVertex(),
    edges:  make([]stateIndex, numberOfStates),
  }
  // By default, all denseStates are invalid
  for i := range v.edges {
    v.edges[i] = invalidState
  }
  return v
}

func (v *vertexDense) nextState(edge int64) (next stateIndex, ok bool) {
  next = v.edges[edge]
  return next, next != invalidState
}

func (v *vertexDense) setNextState(edge int64, si stateIndex) {
  v.edges[edge] = si
}

func (v *vertexDense) setInvalidEdgesTo(si stateIndex) {
  for stateIndex, state := range v.edges {
    if state == invalidState {
      v.edges[stateIndex] = si
    }
  }
}

type denseStates []vertexDense

func newDenseStates() denseStates {
  return make([]vertexDense, 0, 10)
}

func (s denseStates) lastStateIndex() stateIndex {
  return stateIndex(len(s))
}

const (
  startState   stateIndex = 0
  invalidState stateIndex = -1
)

type LowerLetters struct {
  states denseStates
}

type LowerLettersBuilder struct {
  states        denseStates
  keywordsSoFar int
}

const (
  letterStates = 26
)

func NewLowerLetters() (m LowerLettersBuilder) {
  m = LowerLettersBuilder{}
  m.states = newDenseStates()
  m.states = append(m.states, newVertexDense(letterStates))
  return
}

func (m *LowerLettersBuilder) AddKeyword(keyword string) (err error) {
  m.states, err = addKeywordToDenseTrie(m.states, m.keywordsSoFar, keyword)
  m.keywordsSoFar++
  return
}

func (m *LowerLettersBuilder) Build() LowerLetters {
  l := LowerLetters{
    states: m.states,
  }
  m.states = nil
  // Set all of the start Tries that don't match back to the start
  l.states[startState].setInvalidEdgesTo(startState)
  l.states = buildDenseFails(l.states)
  return l
}

func addKeywordToDenseTrie(statesIn denseStates, keywordIndex int, keyword string) (states denseStates, err error) {
  states = statesIn
  currentState := startState
  letterIndex := 0
  for ; letterIndex < len(keyword); letterIndex++ {
    letter := keyword[letterIndex]
    if !isLowerAscii(letter) {
      return states, newInvalidCharsetError(rune(letter))
    }
    letterStateIndex := letter - 'a'
    nextState, ok := states[currentState].nextState(int64(letterStateIndex))
    if !ok {
      break
    }
    currentState = nextState
  }
  for ; letterIndex < len(keyword); letterIndex++ {
    letter := keyword[letterIndex]
    letterStateIndex := letter - 'a'
    v := newVertexDense(letterStates)
    states[currentState].setNextState(int64(letterStateIndex), states.lastStateIndex())
    currentState = states.lastStateIndex()
    states = append(states, v)
  }
  states[currentState].appendOutputIndex([]int{keywordIndex})
  return
}

func isLowerAscii(r uint8) bool {
  return 'a' <= r && r <= 'z'
}

func buildDenseFails(statesIn denseStates) (states denseStates) {
  states = statesIn
  start := states[startState]
  q := stateFifo{}
  // initialize the denseStates at depth 1
  for _, state := range start.edges {
    if state != startState {
      // All failure denseStates at depth 1 go back to the start state
      states[state].failState = startState
      q.Push(state)
    }
  }

  for !q.IsEmpty() {
    statePreviousDepth, _ := q.Peek()
    q.Pop()
    for letterIndex, s := range states[statePreviousDepth].edges {
      if s != invalidState {
        q.Push(s)
        state := states[statePreviousDepth].failState
        _, ok := states[state].nextState(int64(letterIndex))
        for !ok {
          state = states[state].failState
          _, ok = states[state].nextState(int64(letterIndex))
        }
        // Found a failState that is not invalid
        // set the nextState fail state to the entry for this letter
        states[s].failState, _ = states[state].nextState(int64(letterIndex))
        states[s].appendOutputIndex(states[states[s].failState].output)
      }
    }
  }
  return
}


type Output struct {
  KeywordIndex    int
  CharacterOffset uint64
  ByteOffset      uint64
}

type Writer interface {
  Emit(out Output)
  io.Closer
}

type Reader interface {
  Next() (out Output, ok bool)
}

type ReadWriter interface {
  Reader
  Writer
}

func (m LowerLetters) Search(input io.Reader, results *sync) (err error) {
  var bytesReadSoFar uint64
  currentState := startState
  letter := []byte{0}
  defer func() {
    _ = results.Close()
  }()
  for {
    _, err = input.Read(letter)
    if err != nil {
      if isEOFOrClosed(err) {
        err = nil
      }
      return
    }
    bytesReadSoFar++
    letterIndex := letter[0] - 'a'
    _, ok := m.states[currentState].nextState(int64(letterIndex))
    for !ok {
      currentState = m.states[currentState].failState
      _, ok = m.states[currentState].nextState(int64(letterIndex))
    }
    currentState, _ = m.states[currentState].nextState(int64(letterIndex))
    if m.states[currentState].hasOutput() {
      outputs := m.states[currentState].outputs()
      for i, _ := range outputs {
        results.outputs[results.currentWriteIndex] = Output{
          KeywordIndex:    outputs[i],
          ByteOffset:      bytesReadSoFar,
          CharacterOffset: bytesReadSoFar,
        }
        results.currentWriteIndex++
      }
    }
  }
}

func isEOFOrClosed(err error) bool {
  if errors.Is(err, io.EOF) ||
    errors.Is(err, io.ErrClosedPipe) {
    return true
  }
  if pathErr, ok := err.(*os.PathError); ok {
    if pathErr.Err.Error() == "file already closed" {
      return true
    }
  }
  return false
}


// sync
// A result bucket built without Search running in a go routine
// This is not thread-safe and is intended for small, test jobs.
type sync struct {
  outputs          []Output
  currentReadIndex int
  currentWriteIndex int
}

func NewSync(bufferSize int) ReadWriter {
  return &sync{
    outputs: make([]Output, bufferSize),
  }
}

func (r *sync) Reset() {
  r.currentReadIndex = 0
  r.currentWriteIndex = 0
}

func (r *sync) Next() (out Output, ok bool) {
  if r.currentReadIndex >= len(r.outputs) {
    return out, false
  }
  out = r.outputs[r.currentReadIndex]
  r.currentReadIndex++
  return out, true
}

func (r *sync) Emit(out Output) {
  r.outputs[r.currentWriteIndex] = out
  r.currentWriteIndex++
}

func (r *sync) Close() error {
  return nil
}

func main() {
  var reader *bufio.Reader
    reader = bufio.NewReaderSize(os.Stdin, 1024*1024)

  sc := scanner.Scanner{}
  sc.Init(reader)

  sc.Scan()
  nTemp, err := strconv.ParseInt(sc.TokenText(), 10, 64)
  checkError(err)
  n := int32(nTemp)

  sb := NewLowerLetters()
  genes := make([]string, n)
  for i := 0; i < int(n); i++ {
    sc.Scan()
    genes[i] = sc.TokenText()
    _ = sb.AddKeyword(genes[i])
  }

  health := make([]int32, n)
  for i := 0; i < int(n); i++ {
    sc.Scan()
    healthItemTemp, err := strconv.ParseInt(sc.TokenText(), 10, 64)
    checkError(err)
    health[i] = int32(healthItemTemp)
  }

  sc.Scan()
  sTemp, err := strconv.ParseInt(sc.TokenText(), 10, 64)
  checkError(err)
  s := int32(sTemp)

  m := sb.Build()
  var res *sync
  res = NewSync(1000000).(*sync)

  var min int64
  min = math.MaxInt64
  var max int64
  var score int64
  var match Output
  for sItr := 0; sItr < int(s); sItr++ {
    score = 0
    sc.Scan()
    firstTemp, err := strconv.ParseInt(sc.TokenText(), 10, 64)
    checkError(err)
    first := int32(firstTemp)

    sc.Scan()
    lastTemp, err := strconv.ParseInt(sc.TokenText(), 10, 64)
    checkError(err)
    last := int32(lastTemp)

    sc.Scan()
    d := sc.TokenText()

    res.Reset()
    _ = m.Search(bytes.NewBufferString(d), res)
    for i := 0; i < res.currentWriteIndex; i++ {
      match = res.outputs[i]
      if first <= int32(match.KeywordIndex) && int32(match.KeywordIndex) <= last {
        score += int64(health[int32(match.KeywordIndex)])
      }
    }
    if score > max {
      max = score
    }
    if score < min {
      min = score
    }
  }

  fmt.Printf("%d %d", min, max)
}

func checkError(err error) {
    if err != nil {
        panic(err)
    }
}

Commentary

I do think that this is a difficult problem, even if you know that AC is the best, if not the only way, to solve this problem. This is more in-line with what would be required for a Data Analyst position to write something that was extremely fast and efficient.

However, for most programming for business logic, you don’t want to write something that uses the tricks required to get this level of optimization. The code is not very readable or compartmentalized. Even with Go, the encapsulation really slowed down a few things, even calling a method to append a small struct (1 int, 2 int64’s) using the slice append was slow.

This was fun, but probably not a realistic problem for most software engineers (unless you’re doing search engines or data analysis). If you see this as a job interview question, run.

Wait

So you didn’t finish it?

Nope. Not yet. Input 13, my enemy.

Image Credit

“Life, encoded” Pascal via Flickr. Feb 2, 2015. License: Public Domain.

Explaining the HackerRank challenge Determining DNA Health with Aho-Corasick search