ngram
A Crystal Shard for iterating over arbitrarily-sized and typed N-grams. Given
any Iterator(T), Ngram(T, N) will yield StaticArray(T, N) until the
Iterator is exhausted. The first element and any final element(s) of the
yielded n-grams are bumper_item T, which must be defined for any T. The
shard comes with definitions for Char, String, and any union with Nil.
This is based on my ngram_iter rust crate, which was in turn loosely inspired by the ngrams crate.
Installation
-
Add the dependency to your
shard.yml:dependencies: ngram: github: dscottboggs/ngram.cr -
Run
shards install
Usage
require "ngram"
corpus = ('a'..'z').each
bigrams = Ngram(Char, 2).new corpus
bigrams.next # => StaticArray['\u{2060}', 'a']
trigrams = Ngram(Char, 3).new corpus
trigrams.next # => StaticArray['\u{2060}', 'a', 'b']
ten_grams = Ngram(Char, 10).new corpus
ten_grams.next # => StaticArray['\u2060', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
corpus = ([1, 2] of Int32?).each
bigrams = Ngram(Int32?, 2).new corpus
bigrams.next # => StaticArray[nil, 1]
# N must be 2 or more.
NGrams(Char, 1).new ['1'] # This will fail to compile
See the tests for more details on the behavior.
Arbitrary types can be used, but an overload for e.g. type T,
bumper_item(T.class) must be implemented in the top-level namespace. Crystal
doesn't really offer a way to document this in the code, but compilation will
fail if the function overload isn't present.
record MyType, data : Int32, valid : Bool do
def self.invalid
new 0, false
end
def initialize(@data : Int32)
@valid = true
end
def to_s
if valid
data.to_s
else
"invalid"
end
end
def to_s(io : IO)
io.write to_s
end
end
def bumper_item(_type : MyType.class)
MyType.invalid
end
data = [MyType.new(1), MyType.new(2)]
ngrams = Ngram(MyType, 2).new data.each
if (ngram = ngrams.next).is_a? Iterator::Stop
raise "unreachable"
else
ngram.map(&.to_s).to_a # => ["invalid", "1"]
end
This is a bit of a contrived example, but it demonstrates the flexibility of
the shard. Of course, you can always just use a nullable type and the bumepr
will be nil, but it may make more sense to use, for example,
Float64::INFINITY, or a particular enum variant.
Contributing
- Fork it (https://github.com/dscottboggs/ngram/fork)
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request
Contributors
- D. Scott Boggs - creator and maintainer