Recently I implemented a dictionary (as in – look up a word meaning in a dictionary) for use in Natural Language Processing. As this fuctionality was likely to be used with most words, I needed the lookup to be rapid. I found an open source text file to use as a source for the dictionary data itself, but now needed a way to rapidly access any abitrary entry.
The scheme that I decided upon was to index the entries (using CSharpTest.Net.BPlusTree) into one file containing the word and an offset into another data file containing the defintions.
So I wrote a function to create the dictionary file:
public void CreateCondensedDictionary(bool includePlurals = true) { var dfile = new DFile(); Int64 offset = 0; DictionaryEntry dictionaryEntry = null; foreach (var line in ReadLines(_datafileName)) { var tokens = Tokenise(line); var entryKey = tokens.ElementAtOrDefault(1); if (!string.IsNullOrEmpty(entryKey)) { if (dictionaryEntry != null && String.Compare(entryKey, dictionaryEntry.EntryKey, StringComparison.InvariantCultureIgnoreCase) == 0) { // Append definitions to entry dictionaryEntry.AddDefinition(tokens); } else { if (dictionaryEntry != null) { dfile.Add(dictionaryEntry.EntryKey, offset); offset = SerializeBinaryAddToIndex(dictionaryEntry, offset); } if (includePlurals) { // Start new entry dictionaryEntry = new DictionaryEntry { EntryKey = entryKey }; dictionaryEntry.AddDefinition(tokens); } else { if (!IsPlural(tokens)) { // Start new entry dictionaryEntry = new DictionaryEntry { EntryKey = entryKey }; dictionaryEntry.AddDefinition(tokens); } else { dictionaryEntry = null; } } } } } }
Essentially, the code runs through the text dictionary file, line by line and for each entry creates a DictionaryEntry object, then populates it with DictionaryDefinition objects.
Then my next step was to serialise the objects to a binary file, while recording the offset of the write into the file for indexing purposes.
The serialisation of the class was absolutely vanilla:
[Serializable] public class DictionaryEntry { public string EntryKey { get; set; } public List<DictionaryDefinition> Definitions { get; set; } public DictionaryEntry() { Definitions = new List<DictionaryDefinition>(); } public void AddDefinition(IEnumerable<string> entryTokens) { var partOfSpeech = entryTokens.ElementAtOrDefault(2); var definitionText = entryTokens.ElementAtOrDefault(3); if (!string.IsNullOrEmpty(partOfSpeech) && !string.IsNullOrEmpty(definitionText)) { Definitions.Add(new DictionaryDefinition(partOfSpeech, definitionText)); } } }
and the Serialization call:
var file = File.Open(@"e:dict-en-comp.dat", FileMode.OpenOrCreate); file.Seek(offset, SeekOrigin.Begin); using (file) { BinaryFormatter bFormatter = new BinaryFormatter(); bFormatter.Serialize(stream, dictionaryEntry); file.Flush(); offset = file.Position; }
Setting up the index was easy, but when I created a file serialised from this simple implementation of the data the size of the file jumped from 38MB of text data to over 300MB of serialised binary data. Horror. The reason for this was the presence of repeated text strings of type data like this:
k__BackingField.k__BackingField..}System.Collections.Generic.List`1[[DFile.DictionaryDefinition, DFile, Version=1.0.0. Culture=neutral, PublicKeyToken=null]]
Interestingly, the resultant datafile zipped down to about 10MB, giving some indication of just how much redundancy there was in the file. So I looked for a leaner serialisation format, and found Google’s Protocol Buffers format, and in particular the protobuf-net implementation. From Wikipedia:
Protocol Buffers are a serialization format with an interface description language developed by Google. The original Google implementation for C++, Java and Python is available under a free software, open source license. Various other language implementations are either available or in development.
The design goals for Protocol Buffers emphasized simplicity and performance. In particular, it was designed to be smaller and faster than XML.
In order to use this my code hardly changed at all:
[ProtoContract] public class DictionaryEntry { [ProtoMember(1)] public string EntryKey; [ProtoMember(2)] public List<DictionaryDefinition> Definitions; // The rest of the class as is }
and the serialization code:
var file = File.Open(@"e:dict-en-comp.dat", FileMode.OpenOrCreate); file.Seek(offset, SeekOrigin.Begin); using (file) { Serializer.SerializeWithLengthPrefix(file, dictionaryEntry, PrefixStyle.Base128); file.Flush(); offset = file.Position; }
Using Protocol Buffers resulted in a data file of 30MB (which I will further reduce as most of the data is in string form). A tenfold reduction on the CLR’s default binary Formatter code.
Note: I used SerializeWithLengthPrefix so that I could Serialize and Deserialize single objects within the stream of messages: if the normal Serialize function is used, the whole stream is Serialized/Deserialized. I had a little bit of confusion because protobuf-net is, er, sparsely documented, but the author himself Marc Gravell, responded to my query on stackoverflow within minutes of my posing the question. Great library, great guy, thank you very much.