Provider Documentation
TinkerPop exposes a set of interfaces, protocols, and tests that make it possible for third-parties to build libraries and systems that plug-in to the TinkerPop stack. TinkerPop refers to those third-parties as "providers" and this documentation is designed to help providers understand what is involved in developing code on these lower levels of the TinkerPop API.
This document attempts to address the needs of the different providers that have been identified:
-
Graph System Provider
-
Graph Database Provider
-
Graph Processor Provider
-
-
Graph Driver Provider
-
Graph Language Provider
-
Graph Plugin Provider
Graph System Provider Requirements
At the core of TinkerPop3 is a Java8 API. The implementation of this
core API and its validation via the gremlin-test
suite is all that is required of a graph system provider wishing to
provide a TinkerPop3-enabled graph engine. Once a graph system has a valid implementation, then all the applications
provided by TinkerPop (e.g. Gremlin Console, Gremlin Server, etc.) and 3rd-party developers (e.g. Gremlin-Scala,
Gremlin-JS, etc.) will integrate properly. Finally, please feel free to use the logo on the left to promote your
TinkerPop3 implementation.
Implementing Gremlin-Core
The classes that a graph system provider should focus on implementing are itemized below. It is a good idea to study
the TinkerGraph (in-memory OLTP and OLAP
in tinkergraph-gremlin
), Neo4jGraph
(OTLP w/ transactions in neo4j-gremlin
) and/or
HadoopGraph (OLAP in hadoop-gremlin
)
implementations for ideas and patterns.
-
Online Transactional Processing Graph Systems (OLTP)
-
Structure API:
Graph
,Element
,Vertex
,Edge
,Property
andTransaction
(if transactions are supported). -
Process API:
TraversalStrategy
instances for optimizing Gremlin traversals to the provider’s graph system (i.e.TinkerGraphStepStrategy
).
-
-
Online Analytics Processing Graph Systems (OLAP)
-
Everything required of OTLP is required of OLAP (but not vice versa).
-
GraphComputer API:
GraphComputer
,Messenger
,Memory
.
-
Please consider the following implementation notes:
-
Be sure your
Graph
implementation is named asXXXGraph
(e.g. TinkerGraph, Neo4jGraph, HadoopGraph, etc.). -
Use
StringHelper
to ensuring that thetoString()
representation of classes are consistent with other implementations. -
Ensure that your implementation’s
Features
(Graph, Vertex, etc.) are correct so that test cases handle particulars accordingly. -
Use the numerous static method helper classes such as
ElementHelper
,GraphComputerHelper
,VertexProgramHelper
, etc. -
There are a number of default methods on the provided interfaces that are semantically correct. However, if they are not efficient for the implementation, override them.
-
Implement the
structure/
package interfaces first and then, if desired, interfaces in theprocess/
package interfaces. -
ComputerGraph
is aWrapper
system that ensure proper semantics during a GraphComputer computation.
OLTP Implementations
The most important interfaces to implement are in the structure/
package. These include interfaces like Graph, Vertex, Edge, Property, Transaction, etc. The StructureStandardSuite
will ensure that the semantics of the methods implemented are correct. Moreover, there are numerous Exceptions
classes with static exceptions that should be thrown by the graph system so that all the exceptions and their
messages are consistent amongst all TinkerPop3 implementations.
OLAP Implementations
Implementing the OLAP interfaces may be a bit more complicated. Note that before OLAP interfaces are implemented, it is necessary for the OLTP interfaces to be, at minimal, implemented as specified in OLTP Implementations. A summary of each required interface implementation is presented below:
-
GraphComputer
: A fluent builder for specifying an isolation level, a VertexProgram, and any number of MapReduce jobs to be submitted. -
Memory
: A global blackboard for ANDing, ORing, INCRing, and SETing values for specified keys. -
Messenger
: The system that collects and distributes messages being propagated by vertices executing the VertexProgram application. -
MapReduce.MapEmitter
: The system that collects key/value pairs being emitted by the MapReduce applications map-phase. -
MapReduce.ReduceEmitter
: The system that collects key/value pairs being emitted by the MapReduce applications combine- and reduce-phases.
Note
|
The VertexProgram and MapReduce interfaces in the process/computer/ package are not required by the graph
system. Instead, these are interfaces to be implemented by application developers writing VertexPrograms and MapReduce jobs.
|
Important
|
TinkerPop3 provides three OLAP implementations: TinkerGraphComputer (TinkerGraph), GiraphGraphComputer (HadoopGraph), and SparkGraphComputer (Hadoop). Given the complexity of the OLAP system, it is good to study and copy many of the patterns used in these reference implementations. |
Implementing GraphComputer
The most complex method in GraphComputer is the submit()
-method. The method must do the following:
-
Ensure the the GraphComputer has not already been executed.
-
Ensure that at least there is a VertexProgram or 1 MapReduce job.
-
If there is a VertexProgram, validate that it can execute on the GraphComputer given the respectively defined features.
-
Create the Memory to be used for the computation.
-
Execute the VertexProgram.setup() method once and only once.
-
Execute the VertexProgram.execute() method for each vertex.
-
Execute the VertexProgram.terminate() method once and if true, repeat VertexProgram.execute().
-
When VertexProgram.terminate() returns true, move to MapReduce job execution.
-
MapReduce jobs are not required to be executed in any specified order.
-
For each Vertex, execute MapReduce.map(). Then (if defined) execute MapReduce.combine() and MapReduce.reduce().
-
Update Memory with runtime information.
-
Construct a new
ComputerResult
containing the compute Graph and Memory.
Implementing Memory
The Memory object is initially defined by VertexProgram.setup()
.
The memory data is available in the first round of the VertexProgram.execute()
method. Each Vertex, when executing
the VertexProgram, can update the Memory in its round. However, the update is not seen by the other vertices until
the next round. At the end of the first round, all the updates are aggregated and the new memory data is available
on the second round. This process repeats until the VertexProgram terminates.
Implementing Messenger
The Messenger object is similar to the Memory object in that a vertex can read and write to the Messenger. However, the data it reads are the messages sent to the vertex in the previous step and the data it writes are the messages that will be readable by the receiving vertices in the subsequent round.
Implementing MapReduce Emitters
The MapReduce framework in TinkerPop3 is similar to the model
popularized by Hadoop. The primary difference is that all Mappers process the vertices
of the graph, not an arbitrary key/value pair. However, the vertices' edges can not be accessed — only their
properties. This greatly reduces the amount of data needed to be pushed through the MapReduce engine as any edge
information required, can be computed in the VertexProgram.execute() method. Moreover, at this stage, vertices can
not be mutated, only their token and property data read. A Gremlin OLAP system needs to provide implementations for
to particular classes: MapReduce.MapEmitter
and MapReduce.ReduceEmitter
. TinkerGraph’s implementation is provided
below which demonstrates the simplicity of the algorithm (especially when the data is all within the same JVM).
public class TinkerMapEmitter<K, V> implements MapReduce.MapEmitter<K, V> {
public Map<K, Queue<V>> reduceMap;
public Queue<KeyValue<K, V>> mapQueue;
private final boolean doReduce;
public TinkerMapEmitter(final boolean doReduce) { (1)
this.doReduce = doReduce;
if (this.doReduce)
this.reduceMap = new ConcurrentHashMap<>();
else
this.mapQueue = new ConcurrentLinkedQueue<>();
}
@Override
public void emit(K key, V value) {
if (this.doReduce)
this.reduceMap.computeIfAbsent(key, k -> new ConcurrentLinkedQueue<>()).add(value); (2)
else
this.mapQueue.add(new KeyValue<>(key, value)); (3)
}
protected void complete(final MapReduce<K, V, ?, ?, ?> mapReduce) {
if (!this.doReduce && mapReduce.getMapKeySort().isPresent()) { (4)
final Comparator<K> comparator = mapReduce.getMapKeySort().get();
final List<KeyValue<K, V>> list = new ArrayList<>(this.mapQueue);
Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator));
this.mapQueue.clear();
this.mapQueue.addAll(list);
} else if (mapReduce.getMapKeySort().isPresent()) {
final Comparator<K> comparator = mapReduce.getMapKeySort().get();
final List<Map.Entry<K, Queue<V>>> list = new ArrayList<>();
list.addAll(this.reduceMap.entrySet());
Collections.sort(list, Comparator.comparing(Map.Entry::getKey, comparator));
this.reduceMap = new LinkedHashMap<>();
list.forEach(entry -> this.reduceMap.put(entry.getKey(), entry.getValue()));
}
}
}
-
If the MapReduce job has a reduce, then use one data structure (
reduceMap
), else use another (mapList
). The difference being that a reduction requires a grouping by key and therefore, theMap<K,Queue<V>>
definition. If no reduction/grouping is required, then a simpleQueue<KeyValue<K,V>>
can be leveraged. -
If reduce is to follow, then increment the Map with a new value for the key.
MapHelper
is a TinkerPop3 class with static methods for adding data to a Map. -
If no reduce is to follow, then simply append a KeyValue to the queue.
-
When the map phase is complete, any map-result sorting required can be executed at this point.
public class TinkerReduceEmitter<OK, OV> implements MapReduce.ReduceEmitter<OK, OV> {
protected Queue<KeyValue<OK, OV>> reduceQueue = new ConcurrentLinkedQueue<>();
@Override
public void emit(final OK key, final OV value) {
this.reduceQueue.add(new KeyValue<>(key, value));
}
protected void complete(final MapReduce<?, ?, OK, OV, ?> mapReduce) {
if (mapReduce.getReduceKeySort().isPresent()) {
final Comparator<OK> comparator = mapReduce.getReduceKeySort().get();
final List<KeyValue<OK, OV>> list = new ArrayList<>(this.reduceQueue);
Collections.sort(list, Comparator.comparing(KeyValue::getKey, comparator));
this.reduceQueue.clear();
this.reduceQueue.addAll(list);
}
}
}
The method MapReduce.reduce()
is defined as:
public void reduce(final OK key, final Iterator<OV> values, final ReduceEmitter<OK, OV> emitter) { ... }
In other words, for the TinkerGraph implementation, iterate through the entrySet of the reduceMap
and call the
reduce()
method on each entry. The reduce()
method can emit key/value pairs which are simply aggregated into a
Queue<KeyValue<OK,OV>>
in an analogous fashion to TinkerMapEmitter
when no reduce is to follow. These two emitters
are tied together in TinkerGraphComputer.submit()
.
...
for (final MapReduce mapReduce : mapReducers) {
if (mapReduce.doStage(MapReduce.Stage.MAP)) {
final TinkerMapEmitter<?, ?> mapEmitter = new TinkerMapEmitter<>(mapReduce.doStage(MapReduce.Stage.REDUCE));
final SynchronizedIterator<Vertex> vertices = new SynchronizedIterator<>(this.graph.vertices());
workers.setMapReduce(mapReduce);
workers.mapReduceWorkerStart(MapReduce.Stage.MAP);
workers.executeMapReduce(workerMapReduce -> {
while (true) {
final Vertex vertex = vertices.next();
if (null == vertex) return;
workerMapReduce.map(ComputerGraph.mapReduce(vertex), mapEmitter);
}
});
workers.mapReduceWorkerEnd(MapReduce.Stage.MAP);
// sort results if a map output sort is defined
mapEmitter.complete(mapReduce);
// no need to run combiners as this is single machine
if (mapReduce.doStage(MapReduce.Stage.REDUCE)) {
final TinkerReduceEmitter<?, ?> reduceEmitter = new TinkerReduceEmitter<>();
final SynchronizedIterator<Map.Entry<?, Queue<?>>> keyValues = new SynchronizedIterator((Iterator) mapEmitter.reduceMap.entrySet().iterator());
workers.mapReduceWorkerStart(MapReduce.Stage.REDUCE);
workers.executeMapReduce(workerMapReduce -> {
while (true) {
final Map.Entry<?, Queue<?>> entry = keyValues.next();
if (null == entry) return;
workerMapReduce.reduce(entry.getKey(), entry.getValue().iterator(), reduceEmitter);
}
});
workers.mapReduceWorkerEnd(MapReduce.Stage.REDUCE);
reduceEmitter.complete(mapReduce); // sort results if a reduce output sort is defined
mapReduce.addResultToMemory(this.memory, reduceEmitter.reduceQueue.iterator()); (1)
} else {
mapReduce.addResultToMemory(this.memory, mapEmitter.mapQueue.iterator()); (2)
}
}
}
...
-
Note that the final results of the reducer are provided to the Memory as specified by the application developer’s
MapReduce.addResultToMemory()
implementation. -
If there is no reduce stage, the the map-stage results are inserted into Memory as specified by the application developer’s
MapReduce.addResultToMemory()
implementation.
IO Implementations
If a Graph
requires custom serializers for IO to work properly, implement the Graph.io
method. A typical example
of where a Graph
would require such a custom serializers is if their identifier system uses non-primitive values,
such as OrientDB’s Rid
class. From basic serialization of a single Vertex
all the way up the stack to Gremlin
Server, the need to know how to handle these complex identifiers is an important requirement.
The first step to implementing custom serializers is to first implement the IoRegistry
interface and register the
custom classes and serializers to it. Each Io
implementation has different requirements for what it expects from the
IoRegistry
:
-
GraphML - No custom serializers expected/allowed.
-
GraphSON - Register a Jackson
SimpleModule
. TheSimpleModule
encapsulates specific classes to be serialized, so it does not need to be registered to a specific class in theIoRegistry
(usenull
). -
Gryo - Expects registration of one of three objects:
-
Register just the custom class with a
null
KryoSerializer
implementation - this class will use default "field-level" Kryo serialization. -
Register the custom class with a specific Kryo ‘Serializer’ implementation.
-
Register the custom class with a
Function<Kryo, Serializer>
for those cases where the KryoSerializer
requires theKryo
instance to get constructed.
-
This implementation should provide a zero-arg constructor as the stack may require instantiation via reflection.
Consider extending AbstractIoRegistry
for convenience as follows:
public class MyGraphIoRegistry extends AbstractIoRegistry {
public MyGraphIoRegistry() {
register(GraphSONIo.class, null, new MyGraphSimpleModule());
register(GryoIo.class, MyGraphIdClass.class, new MyGraphIdSerializer());
}
}
In the Graph.io
method, provide the IoRegistry
object to the supplied Builder
and call the create
method to
return that Io
instance as follows:
public <I extends Io> I io(final Io.Builder<I> builder) {
return (I) builder.graph(this).registry(myGraphIoRegistry).create();
}}
In this way, Graph
implementations can pre-configure custom serializers for IO interactions and users will not need
to know about those details. Following this pattern will ensure proper execution of the test suite as well as
simplified usage for end-users.
Important
|
Proper implementation of IO is critical to successful Graph operations in Gremlin Server. The Test Suite
does have "serialization" tests that provide some assurance that an implementation is working properly, but those
tests cannot make assertions against any specifics of a custom serializer. It is the responsibility of the
implementer to test the specifics of their custom serializers.
|
Tip
|
Consider separating serializer code into its own module, if possible, so that clients that use the Graph
implementation remotely don’t need a full dependency on the entire Graph - just the IO components and related
classes being serialized.
|
Validating with Gremlin-Test
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-test</artifactId>
<version>3.1.2-incubating</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-groovy-test</artifactId>
<version>3.1.2-incubating</version>
</dependency>
The operational semantics of any OLTP or OLAP implementation are validated by gremlin-test
and functional
interoperability with the Groovy environment is ensured by gremlin-groovy-test
. To implement these tests, provide
test case implementations as shown below, where XXX
below denotes the name of the graph implementation (e.g.
TinkerGraph, Neo4jGraph, HadoopGraph, etc.).
// Structure API tests
@RunWith(StructureStandardSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
public class XXXStructureStandardTest {}
// Process API tests
@RunWith(ProcessComputerSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
public class XXXProcessComputerTest {}
@RunWith(ProcessStandardSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = XXXGraph.class)
public class XXXProcessStandardTest {}
@RunWith(GroovyEnvironmentSuite.class)
@GraphProviderClass(provider = XXXProvider.class, graph = TinkerGraph.class)
public class XXXGroovyEnvironmentTest {}
@RunWith(GroovyProcessStandardSuite.class)
@GraphProviderClass(provider = XXXGraphProvider.class, graph = TinkerGraph.class)
public class XXXGroovyProcessStandardTest {}
@RunWith(GroovyProcessComputerSuite.class)
@GraphProviderClass(provider = XXXGraphComputerProvider.class, graph = TinkerGraph.class)
public class XXXGroovyProcessComputerTest {}
The above set of tests represent the minimum test suite set to implement. There are other "integration" and "performance" tests that should be considered optional. Implementing those tests requires the same pattern as shown above.
Important
|
It is as important to look at "ignored" tests as it is to look at ones that fail. The gremlin-test
suite utilizes the Feature implementation exposed by the Graph to determine which tests to execute. If a test
utilizes features that are not supported by the graph, it will ignore them. While that may be fine, implementers
should validate that the ignored tests are appropriately bypassed and that there are no mistakes in their feature
definitions. Moreover, implementers should consider filling gaps in their own test suites, especially when
IO-related tests are being ignored.
|
The only test-class that requires any code investment is the GraphProvider
implementation class. This class is a
used by the test suite to construct Graph
configurations and instances and provides information about the
implementation itself. In most cases, it is best to simply extend AbstractGraphProvider
as it provides many
default implementations of the GraphProvider
interface.
Finally, specify the test suites that will be supported by the Graph
implementation using the @Graph.OptIn
annotation. See the TinkerGraph
implementation below as an example:
@Graph.OptIn(Graph.OptIn.SUITE_STRUCTURE_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER)
@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_PROCESS_COMPUTER)
@Graph.OptIn(Graph.OptIn.SUITE_GROOVY_ENVIRONMENT)
public class TinkerGraph implements Graph {
Only include annotations for the suites the implementation will support. Note that implementing the suite, but not specifying the appropriate annotation will prevent the suite from running (an obvious error message will appear in this case when running the mis-configured suite).
There are times when there may be a specific test in the suite that the implementation cannot support (despite the
features it implements) or should not otherwise be executed. It is possible for implementers to "opt-out" of a test
by using the @Graph.OptOut
annotation. The following is an example of this annotation usage as taken from
HadoopGraph
:
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_STANDARD)
@Graph.OptIn(Graph.OptIn.SUITE_PROCESS_COMPUTER)
@Graph.OptOut(
test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals",
method = "g_V_matchXa_hasXname_GarciaX__a_inXwrittenByX_b__a_inXsungByX_bX",
reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.")
@Graph.OptOut(
test = "org.apache.tinkerpop.gremlin.process.graph.step.map.MatchTest$Traversals",
method = "g_V_matchXa_inXsungByX_b__a_inXsungByX_c__b_outXwrittenByX_d__c_outXwrittenByX_e__d_hasXname_George_HarisonX__e_hasXname_Bob_MarleyXX",
reason = "Hadoop-Gremlin is OLAP-oriented and for OLTP operations, linear-scan joins are required. This particular tests takes many minutes to execute.")
@Graph.OptOut(
test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest",
method = "shouldNotAllowBadMemoryKeys",
reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.")
@Graph.OptOut(
test = "org.apache.tinkerpop.gremlin.process.computer.GraphComputerTest",
method = "shouldRequireRegisteringMemoryKeys",
reason = "Hadoop does a hard kill on failure and stops threads which stops test cases. Exception handling semantics are correct though.")
public class HadoopGraph implements Graph {
The above examples show how to ignore individual tests. It is also possible to:
-
Ignore an entire test case (i.e. all the methods within the test) by setting the
method
to "*". -
Ignore a "base" test class such that test that extend from those classes will all be ignored. This style of ignoring is useful for Gremlin "process" tests that have bases classes that are extended by various Gremlin flavors (e.g. groovy).
-
Ignore a
GraphComputer
test based on the type ofGraphComputer
being used. Specify the "computer" attribute on theOptOut
(which is an array specification) which should have a value of theGraphComputer
implementation class that should ignore that test. This attribute should be left empty for "standard" execution and by default allGraphComputer
implementations will be included in theOptOut
so if there are multiple implementations, explicitly specify the ones that should be excluded.
Also note that some of the tests in the Gremlin Test Suite are parameterized tests and require an additional level of specificity to be properly ignored. To ignore these types of tests, examine the name template of the parameterized tests. It is defined by a Java annotation that looks like this:
@Parameterized.Parameters(name = "expect({0})")
The annotation above shows that the name of each parameterized test will be prefixed with "expect" and have
parentheses wrapped around the first parameter (at index 0) value supplied to each test. This information can
only be garnered by studying the test set up itself. Once the pattern is determined and the specific unique name of
the parameterized test is identified, add it to the specific
property on the OptOut
annotation in addition to
the other arguments.
These annotations help provide users a level of transparency into test suite compliance (via the
describeGraph() utility function). It also
allows implementers to have a lot of flexibility in terms of how they wish to support TinkerPop. For example, maybe
there is a single test case that prevents an implementer from claiming support of a Feature
. The implementer could
choose to either not support the Feature
or to support it but "opt-out" of the test with a "reason" as to why so
that users understand the limitation.
Important
|
Before using OptOut be sure that the reason for using it is sound and it is more of a last resort.
It is possible that a test from the suite doesn’t properly represent the expectations of a feature, is too broad or
narrow for the semantics it is trying to enforce or simply contains a bug. Please consider raising issues in the
developer mailing list with such concerns before assuming OptOut is the only answer.
|
Important
|
There are no tests that specifically validate complete compliance with Gremlin Server. Generally speaking,
a Graph that passes the full Test Suite, should be compliant with Gremlin Server. The one area where problems can
occur is in serialization. Always ensure that IO is properly implemented, that custom serializers are tested fully
and ultimately integration test the Graph with an actual Gremlin Server instance.
|
Warning
|
Configuring tests to run in parallel might result in errors that are difficult to debug as there is some
shared state in test execution around graph configuration. It is therefore recommended that parallelism be turned
off for the test suite (the Maven SureFire Plugin is configured this way by default). It may also be important to
include this setting, <reuseForks>false</reuseForks> , in the SureFire configuration if tests are failing in an
unexplainable way.
|
Accessibility via GremlinPlugin
The applications distributed with TinkerPop3 do not distribute with
any graph system implementations besides TinkerGraph. If your implementation is stored in a Maven repository (e.g.
Maven Central Repository), then it is best to provide a GremlinPlugin
implementation so the respective jars can be
downloaded according and when required by the user. Neo4j’s GremlinPlugin is provided below for reference.
public class Neo4jGremlinPlugin implements GremlinPlugin {
private static final String IMPORT = "import ";
private static final String DOT_STAR = ".*";
private static final Set<String> IMPORTS = new HashSet<String>() {{
add(IMPORT + Neo4jGraph.class.getPackage().getName() + DOT_STAR);
}};
@Override
public String getName() {
return "neo4j";
}
@Override
public void pluginTo(final PluginAcceptor pluginAcceptor) {
pluginAcceptor.addImports(IMPORTS);
}
}
With the above plugin implementations, users can now download respective binaries for Gremlin Console, Gremlin Server, etc.
gremlin> g = Neo4jGraph.open('/tmp/neo4j')
No such property: Neo4jGraph for class: groovysh_evaluate
Display stack trace? [yN]
gremlin> :install org.apache.tinkerpop neo4j-gremlin 3.1.2-incubating
==>loaded: [org.apache.tinkerpop, neo4j-gremlin, …]
gremlin> :plugin use tinkerpop.neo4j
==>tinkerpop.neo4j activated
gremlin> g = Neo4jGraph.open('/tmp/neo4j')
==>neo4jgraph[EmbeddedGraphDatabase [/tmp/neo4j]]
In-Depth Implementations
The graph system implementation details presented thus far are minimum requirements necessary to yield a valid TinkerPop3 implementation. However, there are other areas that a graph system provider can tweak to provide an implementation more optimized for their underlying graph engine. Typical areas of focus include:
-
Traversal Strategies: A TraversalStrategy can be used to alter a traversal prior to its execution. A typical example is converting a pattern of
g.V().has('name','marko')
into a global index lookup for all vertices with name "marko". In this way, aO(|V|)
lookup becomes anO(log(|V|))
. Please reviewTinkerGraphStepStrategy
for ideas. -
Step Implementations: Every step is ultimately referenced by the
GraphTraversal
interface. It is possible to extendGraphTraversal
to use a graph system specific step implementation.
Graph Driver Provider Requirements
One of the roles for Gremlin Server is to provide a bridge from TinkerPop to non-JVM languages (e.g. Go, Python, etc.). Developers can build language bindings (or driver) that provide a way to submit Gremlin scripts to Gremlin Server and get back results. Given the extensible nature of Gremlin Server, it is difficult to provide an authoritative guide to developing a driver. It is however possible to describe the core communication protocol using the standard out-of-the-box configuration which should provide enough information to develop a driver for a specific language.
Gremlin Server is distributed with a configuration that utilizes WebSockets with a custom sub-protocol. Under this configuration, Gremlin Server accepts requests containing a Gremlin script, evaluates that script and then streams back the results. The notion of "streaming" is depicted in the diagram to the right.
The diagram shows an incoming request to process the Gremlin script of g.V()
. Gremlin Server evaluates that script,
getting an Iterator
of vertices as a result, and steps through each Vertex
within it. The vertices are batched
together given the resultIterationBatchSize
configuration. In this case, that value must be 2
given that each
"response" contains two vertices. Each response is serialized given the requested serializer type (JSON is likely
best for non-JVM languages) and written back to the requesting client immediately. Gremlin Server does not wait for
the entire result to be iterated, before sending back a response. It will send the responses as they are realized.
This approach allows for the processing of large result sets without having to serialize the entire result into memory for the response. It places a bit of a burden on the developer of the driver however, because it becomes necessary to provide a way to reconstruct the entire result on the client side from all of the individual responses that Gremlin Server returns for a single request. Again, this description of Gremlin Server’s "flow" is related to the out-of-the-box configuration. It is quite possible to construct other flows, that might be more amenable to a particular language or style of processing.
To formulate a request to Gremlin Server, a RequestMessage
needs to be constructed. The RequestMessage
is a
generalized representation of a request that carries a set of "standard" values in addition to optional ones that are
dependent on the operation being performed. A RequestMessage
has these fields:
Key | Description |
---|---|
requestId |
A UUID representing the unique identification for the request. |
op |
The name of the "operation" to execute based on the available |
processor |
The name of the |
args |
A |
This message can be serialized in any fashion that is supported by Gremlin Server. New serialization methods can
be plugged in by implementing a ServiceLoader
enabled MessageSerializer
, however Gremlin Server provides for
JSON serialization by default which will be good enough for purposes of most developers building drivers.
A RequestMessage
to evaluate a script with variable bindings looks like this in JSON:
{ "requestId":"1d6d02bd-8e56-421d-9438-3bd6d0079ff1",
"op":"eval",
"processor":"",
"args":{"gremlin":"g.traversal().V(x).out()",
"bindings":{"x":1},
"language":"gremlin-groovy"}}
The above JSON represents the "body" of the request to send to Gremlin Server. When sending this "body" over websockets Gremlin Server can accept a packet frame using a "text" (1) or a "binary" (2) opcode. Using "text" is a bit more limited in that Gremlin Server will always process the body of that request as JSON. Generally speaking "text" is just for testing purposes.
The preferred method for sending requests to Gremlin Server is to use the "binary" opcode. In this case, a "header"
will need be sent in addition to to the "body". The "header" basically consists of a "mime type" so that Gremlin
Server knows how to deserialize the RequestMessage
. So, the actual byte array sent to Gremlin Server would be
formatted as follows:
The first byte represents the length of the "mime type" string value that follows. Given the default configuration of
Gremlin Server, this value should be set to application/json
. The "payload" represents the JSON message above
encoded as bytes.
Note
|
Gremlin Server will only accept masked packets as it pertains to websocket packet header construction. |
When Gremlin Server receives that request, it will decode it given the "mime type", pass it to the requested
OpProcessor
which will execute the op
defined in the message. In this case, it will evaluate the script
g.traversal().V(x).out()
using the bindings
supplied in the args
and stream back the results in a series of
ResponseMessages
. A ResponseMessage
looks like this:
Key | Description |
---|---|
requestId |
The identifier of the |
status |
The |
result |
The |
In this case the ResponseMessage
returned to the client would look something like this:
{"result":{"data":[{"id": 2,"label": "person","type": "vertex","properties": [
{"id": 2, "value": "vadas", "label": "name"},
{"id": 3, "value": 27, "label": "age"}]},
], "meta":{}},
"requestId":"1d6d02bd-8e56-421d-9438-3bd6d0079ff1",
"status":{"code":206,"attributes":{},"message":""}}
Gremlin Server is capable of streaming results such that additional responses will arrive over the websocket until
the iteration of the result on the server is complete. Each successful incremental message will have a ResultCode
of 206
. Termination of the stream will be marked by a final 200
status code. Note that all messages without a
206
represent terminating conditions for a request. The following table details the various status codes that
Gremlin Server will send:
Code | Name | Description |
---|---|---|
200 |
SUCCESS |
The server successfully processed a request to completion - there are no messages remaining in this stream. |
204 |
NO CONTENT |
The server processed the request but there is no result to return (e.g. an |
206 |
PARTIAL CONTENT |
The server successfully returned some content, but there is more in the stream to arrive - wait for a |
401 |
UNAUTHORIZED |
The request attempted to access resources that the requesting user did not have access to. |
407 |
AUTHENTICATE |
A challenge from the server for the client to authenticate its request. |
498 |
MALFORMED REQUEST |
The request message was not properly formatted which means it could not be parsed at all or the "op" code was not recognized such that Gremlin Server could properly route it for processing. Check the message format and retry the request. |
499 |
INVALID REQUEST ARGUMENTS |
The request message was parseable, but the arguments supplied in the message were in conflict or incomplete. Check the message format and retry the request. |
500 |
SERVER ERROR |
A general server error occurred that prevented the request from being processed. |
597 |
SCRIPT EVALUATION ERROR |
The script submitted for processing evaluated in the |
598 |
SERVER TIMEOUT |
The server exceeded one of the timeout settings for the request and could therefore only partially responded or did not respond at all. |
599 |
SERVER SERIALIZATION ERROR |
The server was not capable of serializing an object that was returned from the script supplied on the request. Either transform the object into something Gremlin Server can process within the script or install mapper serialization classes to Gremlin Server. |
OpProcessors Arguments
The following sections define a non-exhaustive list of available operations and arguments for embedded OpProcessors
(i.e. ones packaged with Gremlin Server).
Common
All OpProcessor
instances support these arguments.
Key | Type | Description |
---|---|---|
batchSize |
Int |
When the result is an iterator this value defines the number of iterations each |
Standard OpProcessor
The "standard" OpProcessor
handles requests for the primary function of Gremlin Server - executing Gremlin.
Requests made to this OpProcessor
are "sessionless" in the sense that a request must encapsulate the entirety
of a transaction. There is no state maintained between requests. A transaction is started when the script is first
evaluated and is committed when the script completes (or rolled back if an error occurred).
Key | Description | ||||||
---|---|---|---|---|---|---|---|
processor |
As this is the default |
||||||
op |
|
authentication
operation arguments
Key | Type | Description |
---|---|---|
sasl |
byte[] |
Required The response to the server authentication challenge. This value is dependent on the SASL authentication mechanism required by the server. |
eval
operation arguments
Key | Type | Description |
---|---|---|
gremlin |
String |
Required The Gremlin script to evaluate. |
bindings |
Map |
A map of key/value pairs to apply as variables in the context of the Gremlin script. |
language |
String |
The flavor of Gremlin used (e.g. |
aliases |
Map |
A map of key/value pairs that allow globally bound |
scriptEvalTimeout |
Long |
An override for the server setting that determines the maximum time to wait for a script to execute on the server. |
Session OpProcessor
The "session" OpProcessor
handles requests for the primary function of Gremlin Server - executing Gremlin. It is
like the "standard" OpProcessor
, but instead maintains state between sessions and allows the option to leave all
transaction management up to the calling client. It is important that clients that open sessions, commit or roll
them back, however Gremlin Server will try to clean up such things when a session is killed that has been abandoned.
It is important to consider that a session can only be maintained with a single machine. In the event that multiple
Gremlin Server are deployed, session state is not shared among them.
Key | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|
processor |
This value should be set to |
||||||||
op |
|
authentication
operation arguments
Key | Type | Description |
---|---|---|
sasl |
byte[] |
Required The response to the server authentication challenge. This value is dependent on the SASL authentication mechanism required by the server. |
eval
operation arguments
Key | Type | Description |
---|---|---|
gremlin |
String |
Required The Gremlin script to evaluate. |
session |
String |
Required The session identifier for the current session - typically this value should be a UUID (the session will be created if it doesn’t exist). |
manageTransaction |
Boolean |
When set to |
bindings |
Map |
A map of key/value pairs to apply as variables in the context of the Gremlin script. |
scriptEvalTimeout |
Long |
An override for the server setting that determines the maximum time to wait for a script to execute on the server. |
language |
String |
The flavor of Gremlin used (e.g. |
aliases |
Map |
A map of key/value pairs that allow globally bound |
close
operation arguments
Key | Type | Description |
---|---|---|
session |
String |
Required The session identifier for the session to close. |
Authentication
Gremlin Server supports SASL-based
authentication. A SASL implementation provides a series of challenges and responses that a driver must comply with
in order to authenticate. By default, Gremlin Server only supports the "PLAIN" SASL mechanism, which is a cleartext
password system. When authentication is enabled, an incoming request is intercepted before it is evaluated by the
ScriptEngine
. The request is saved on the server and a AUTHENTICATE
challenge response (status code 407
) is
returned to the client.
The client will detect the AUTHENTICATE
and respond with an authentication
for the op
and an arg
named sasl
that contains the password. The password should be either, an encoded sequence of UTF-8 bytes, delimited by 0
(US-ASCII NUL), where the form is : <NUL>username<NUL>password
, or a Base64 encoded string of the former (which
in this instance would be AHVzZXJuYW1lAHBhc3N3b3Jk
). Should Gremlin Server be able to authenticate with the
provided credentials, the server will return the results of the original request as it normally does without
authentication. If it cannot authenticate given the challenge response from the client, it will return UNAUTHORIZED
(status code 401
).
Note
|
Gremlin Server does not support the "authorization identity" as described in RFC4616. |
Gremlin Plugins
Plugins provide a way to expand the features of Gremlin Console and Gremlin Server. The first step to developing a
plugin is to implement the GremlinPlugin
interface:
package org.apache.tinkerpop.gremlin.groovy.plugin;
import java.util.Optional;
/**
* Those wanting to extend Gremlin can implement this interface to provide mapper imports and extension
* methods to the language itself. Gremlin uses {@code ServiceLoader} to install plugins. It is necessary for
* projects to include a {@code org.apache.tinkerpop.gremlin.groovy.plugin.GremlinPlugin} file in
* {@code META-INF/services} of their packaged project which includes the full class names of the implementations
* of this interface to install.
*
* @author Stephen Mallette (http://stephen.genoprime.com)
*/
public interface GremlinPlugin {
public static final String ENVIRONMENT = "GremlinPlugin.env";
/**
* The name of the plugin. This name should be unique (use a namespaced approach) as naming clashes will
* prevent proper plugin operations. Plugins developed by TinkerPop will be prefixed with "tinkerpop."
* For example, TinkerPop's implementation of Giraph would be named "tinkerpop.giraph". If Facebook were
* to do their own implementation the implementation might be called "facebook.giraph".
*/
public String getName();
/**
* Implementers will typically execute imports of classes within their project that they want available in the
* console or they may use meta programming to introduce new extensions to the Gremlin.
*
* @throws IllegalEnvironmentException if there are missing environment properties required by the plugin as
* provided from {@link PluginAcceptor#environment()}.
* @throws PluginInitializationException if there is a failure in the plugin iniitalization process
*/
public void pluginTo(final PluginAcceptor pluginAcceptor) throws IllegalEnvironmentException, PluginInitializationException;
/**
* Some plugins may require a restart of the plugin host for the classloader to pick up the features. This is
* typically true of plugins that rely on {@code Class.forName()} to dynamically instantiate classes from the
* root classloader (e.g. JDBC drivers that instantiate via @{code DriverManager}).
*/
public default boolean requireRestart() {
return false;
}
/**
* Allows a plugin to utilize features of the {@code :remote} and {@code :submit} commands of the Gremlin Console.
* This method does not need to be implemented if the plugin is not meant for the Console for some reason or
* if it does not intend to take advantage of those commands.
*/
public default Optional<RemoteAcceptor> remoteAcceptor() {
return Optional.empty();
}
}
The most simple plugin and the one most commonly implemented will likely be one that just provides a list of classes to import to the Gremlin Console. This type of plugin is the easiest way for implementers of the TinkerPop Structure and Process APIs to make their implementations available to users. The TinkerGraph implementation has just such a plugin:
package org.apache.tinkerpop.gremlin.tinkergraph.groovy.plugin;
import org.apache.tinkerpop.gremlin.groovy.plugin.AbstractGremlinPlugin;
import org.apache.tinkerpop.gremlin.groovy.plugin.IllegalEnvironmentException;
import org.apache.tinkerpop.gremlin.groovy.plugin.PluginAcceptor;
import org.apache.tinkerpop.gremlin.groovy.plugin.PluginInitializationException;
import org.apache.tinkerpop.gremlin.tinkergraph.process.computer.TinkerGraphComputer;
import org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph;
import java.util.HashSet;
import java.util.Set;
/**
* @author Stephen Mallette (http://stephen.genoprime.com)
*/
public final class TinkerGraphGremlinPlugin extends AbstractGremlinPlugin {
private static final Set<String> IMPORTS = new HashSet<String>() {{
add(IMPORT_SPACE + TinkerGraph.class.getPackage().getName() + DOT_STAR);
add(IMPORT_SPACE + TinkerGraphComputer.class.getPackage().getName() + DOT_STAR);
}};
@Override
public String getName() {
return "tinkerpop.tinkergraph";
}
@Override
public void pluginTo(final PluginAcceptor pluginAcceptor) throws PluginInitializationException, IllegalEnvironmentException {
pluginAcceptor.addImports(IMPORTS);
}
@Override
public void afterPluginTo(final PluginAcceptor pluginAcceptor) throws IllegalEnvironmentException, PluginInitializationException {
}
}
Note that the plugin provides a unique name for the plugin which follows a namespaced pattern as namespace.plugin-name
(e.g. "tinkerpop.hadoop" - "tinkerpop" is the reserved namespace for TinkerPop maintained plugins). To make TinkerGraph
classes available to the Console, the PluginAcceptor
is given a Set
of imports to provide to the plugin host. The
PluginAcceptor
essentially behaves as an abstraction to the "host" that is handling the GremlinPlugin
. GremlinPlugin
implementations maybe hosted by the Console as well as the ScriptEngine
in Gremlin Server. Obviously, registering
new commands and other operations that are specific to the Groovy Shell don’t make sense there. Write the code for
the plugin defensively by checking the GremlinPlugin.env
key in the PluginAcceptor.environment()
to understand
which environment the plugin is being used in.
There is one other step to follow to ensure that the GremlinPlugin
is visible to its hosts. GremlinPlugin
implementations are loaded via ServiceLoader
and therefore need a resource file added to the jar file where the plugin exists. Add a file called
org.apache.tinkerpop.gremlin.groovy.plugin.GremlinPlugin
to META-INF.services
. In the case of the TinkerGraph
plugin above, that file will have this line in it:
org.apache.tinkerpop.gremlin.tinkergraph.groovy.plugin.TinkerGraphGremlinPlugin
Once the plugin is packaged, there are two ways to test it out:
-
Copy the jar and its dependencies to the Gremlin Console path and start it.
-
Start Gremlin Console and try the
:install
command::install com.company my-plugin 1.0.0
.
In either case, once one of these two approaches is taken, the jars and their dependencies are available to the
Console. The next step is to "activate" the plugin by doing :plugin use my-plugin
, where "my-plugin" refers to the
name of the plugin to activate.
Note
|
When :install is used logging dependencies related to SLF4J are filtered out so as
not to introduce multiple logger bindings (which generates warning messages to the logs).
|
A plugin can do much more than just import classes. One can expand the Gremlin language with new functions or steps,
provide useful commands to make repetitive or complex tasks easier to execute, or do helpful integrations with other
systems. The secret to doing so lies in the PluginAcceptor
. As mentioned earlier, the PluginAcceptor
provides
access to the host of the plugin. It provides several important methods for doing so:
-
addBinding
- These two function allow the plugin to inject whatever context it wants to the host. For example, doingaddBinding('x',1)
would place a variable ofx
with a value of 1 into the console at the time of the plugin load. -
eval
- Evaluates a script in the context of the host at the time of plugin startup. For example, doingeval("sum={x,y->x+y}")
would create asum
function that would be available to the user of the Console after the load of the plugin. -
environment
- Provides context from the host environment. For the console, the environment will return aMap
containing a reference to theIO
stream and theGroovysh
instance. These classes represent very low-level access to the underpinnings of the console. Access toGroovysh
allows for advanced features such as registering new commands (e.g. like the:plugin
or:remote
commands).
Plugins can also tie into the :remote
and :submit
commands. Recall that a :remote
represents a different
context within which Gremlin is executed, when issued with :submit
. It is encouraged to use this integration point
when possible, as opposed to registering new commands that can otherwise follow the :remote
and :submit
pattern.
To expose this integration point as part of a plugin, implement the RemoteAcceptor
interface:
Tip
|
Be good to the users of plugins and prevent dependency conflicts. Maintaining a conflict free plugin is most easily done by using the Maven Enforcer Plugin. |
Tip
|
Consider binding the plugin’s minor version to the TinkerPop minor version so that it’s easy for users to figure out plugin compatibility. Otherwise, clearly document a compatibility matrix for the plugin somewhere that users can find it. |
package org.apache.tinkerpop.gremlin.groovy.plugin;
import org.codehaus.groovy.tools.shell.Groovysh;
import java.io.Closeable;
import java.util.List;
/**
* The Gremlin Console supports the {@code :remote} and {@code :submit} commands which provide standardized ways
* for plugins to provide "remote connections" to resources and a way to "submit" a command to those resources.
* A "remote connection" does not necessarily have to be a remote server. It simply refers to a resource that is
* external to the console.
* <p/>
* By implementing this interface and returning an instance of it through {@link GremlinPlugin#remoteAcceptor()} a
* plugin can hook into those commands and provide remoting features.
*
* @author Stephen Mallette (http://stephen.genoprime.com)
*/
public interface RemoteAcceptor extends Closeable {
public static final String RESULT = "result";
/**
* Gets called when {@code :remote} is used in conjunction with the "connect" option. It is up to the
* implementation to decide how additional arguments on the line should be treated after "connect".
*
* @return an object to display as output to the user
* @throws org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException if there is a problem with connecting
*/
public Object connect(final List<String> args) throws RemoteException;
/**
* Gets called when {@code :remote} is used in conjunction with the {@code config} option. It is up to the
* implementation to decide how additional arguments on the line should be treated after {@code config}.
*
* @return an object to display as output to the user
* @throws org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException if there is a problem with configuration
*/
public Object configure(final List<String> args) throws RemoteException;
/**
* Gets called when {@code :submit} is executed. It is up to the implementation to decide how additional
* arguments on the line should be treated after {@code :submit}.
*
* @return an object to display as output to the user
* @throws org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException if there is a problem with submission
*/
public Object submit(final List<String> args) throws RemoteException;
/**
* If the {@code RemoteAcceptor} is used in the Gremlin Console, then this method might be called to determine
* if it can be used in a fashion that supports the {@code :remote console} command. By default, this value is
* set to {@code false}.
* <p/>
* A {@code RemoteAcceptor} should only return {@code true} for this method if it expects that all activities it
* supports are executed through the {@code :sumbit} command. If the users interaction with the remote requires
* working with both local and remote evaluation at the same time, it is likely best to keep this method return
* {@code false}. A good example of this type of plugin would be the Gephi Plugin which uses {@code :remote config}
* to configure a local {@code TraversalSource} to be used and expects calls to {@code :submit} for the same body
* of analysis.
*/
public default boolean allowRemoteConsole() {
return false;
}
/**
* Retrieve a script as defined in the shell context. This allows for multi-line scripts to be submitted.
*/
public static String getScript(final String submittedScript, final Groovysh shell) {
return submittedScript.startsWith("@") ? shell.getInterp().getContext().getProperty(submittedScript.substring(1)).toString() : submittedScript;
}
}
The RemoteAcceptor
implementation ties to a GremlinPlugin
and will only be executed when in use with the Gremlin
Console plugin host. Simply instantiate and return a RemoteAcceptor
in the GremlinPlugin.remoteAcceptor()
method
of the plugin implementation. Generally speaking, each call to remoteAcceptor()
should produce a new instance of
a RemoteAcceptor
. It will likely be necessary that you provide context from the GremlinPlugin
to the
RemoteAcceptor
plugin. For example, the RemoteAcceptor
implementation might require an instance of Groovysh
to provide a way to dynamically evaluate a script provided to it so that it can process the results in a different way.