Recipe 8.8 Compiling Regular Expressions
Problem
You have a handful of regular expressions
to execute as quickly as possible over many different strings.
Performance is of the utmost importance.
Solution
The best way to do this task is to use compiled regular expressions.
However, there are some drawbacks to using this technique, which we
will examine.
There are two ways to compile regular
expressions. The easiest way is to use the
RegexOptions.Compiled enumeration value in the
Options parameter of the static
Match or Matches methods on the
Regex class:
Match theMatch = Regex.Match(source, pattern, RegexOptions.Compiled);
MatchCollection theMatches = Regex.Matches(source, pattern, RegexOptions.Compiled);
If more than a few expressions will be compiled and/or the
expressions need to be shared across applications, consider
precompiling all of these expressions into their own assembly.
Do this by using the
static CompileToAssembly method on the
Regex class. The following method accepts an
assembly name and compiles two simple regular expressions into this
assembly:
public static void CreateRegExDLL(string assmName)
{
RegexCompilationInfo[] RE = new RegexCompilationInfo[2]
{new RegexCompilationInfo("PATTERN", RegexOptions.Compiled,
"CompiledPATTERN", "Chapter_Code", true),
new RegexCompilationInfo("NAME", RegexOptions.Compiled,
"CompiledNAME", "Chapter_Code", true)};
System.Reflection.AssemblyName aName =
new System.Reflection.AssemblyName( );
aName.Name = assmName;
Regex.CompileToAssembly(RE, aName);
}
Now that the expressions are compiled to an assembly, the assembly
can be added as a reference to your project and used as follows:
Chapter_Code.CompiledNAME CN = new Chapter_Code.CompiledNAME( );
Match mName = CN.Match("Get the NAME from this text.");
Console.WriteLine("mName.Value = " + mName.Value);
This code displays the following text:
mName.Value = NAME
Discussion
Compiling regular expressions allows the expression to run faster. To
understand how, we need to examine the process that an expression
goes through as it is run against a string. If an expression is not
compiled, the regular expression engine converts the expression to a
series of internal codes that are recognized by the regular
expression engine; it is not converted to MSIL. As the expression
runs against a string, the engine interprets the series of internal
codes. This can be a slow process, especially as the source string
becomes very large and the expression becomes much more complex.
To fix this performance problem, you can compile the expression so
that it gets converted directly to a series of MSIL instructions,
which perform the pattern matching for the specific regular
expression. Once the Just-In-Time (JIT) compiler is run on this MSIL,
the instructions are converted to machine code. This allows for an
extremely fast execution of the pattern against a string.
There are two drawbacks to using the
RegexOptions.Compiled enumerated value to compile
regular expressions. The first is that the first time an expression
is used with the Compiled flag, it performs very
slowly, due to the compilation process. Fortunately, this is a
one-time expense since every unique expression is compiled only once.
The second drawback is that an in-memory assembly gets generated to
contain the IL, which can never be unloaded. An assembly can never be
unloaded from an AppDomain. The garbage collector cannot remove it
from memory. If large numbers of expressions are compiled, the amount
of heap resources that will be used up and not released will be
larger. So use this technique wisely.
Compiling regular expressions into their own assembly immediately
gives you two benefits. First, precompiled expressions do not require
any extra time to be compiled while your application is running.
Second, they are in their own assembly and therefore can be used by
other applications.
 |
Consider precompiling regular expressions and placing them in their
own assembly rather than using the
RegexOptions.Compiled flag.
|
|
To compile one or more expressions into an assembly, the static
CompileToAssembly method of the
Regex class must be used. To use this method, a
RegexCompilationInfo array must be created and
filled with RegexCompilationInfo objects. The next
step is to create the assembly in which the expression will live. An
instance of the AssemblyName class is created
using the default constructor. Next, this assembly is given a name
(do not include the .dll file extension in the
name, it is added automatically). Finally, the
CompileToAssembly method can be called with the
RegexCompilationInfo array and the
AssemblyName object supplied as
arguments.
 |
In our example, this assembly is placed in the same directory that
the executable was launched from.
|
|
See Also
See the ".NET Framework Regular
Expressions" and "AssemblyName
Class" topics in the MSDN documentation.
|