Interface MapReduce<MK,MV,RK,RV,R>
-
- All Superinterfaces:
Cloneable
- All Known Implementing Classes:
ClusterCountMapReduce
,ClusterPopulationMapReduce
,GraphComputerTest.MapReduceB
,GraphComputerTest.MapReduceBB
,GraphComputerTest.MapReduceC
,GraphComputerTest.MapReduceK
,PageRankMapReduce
,StaticMapReduce
public interface MapReduce<MK,MV,RK,RV,R> extends Cloneable
A MapReduce is composed of map(), combine(), and reduce() stages. TheMapReduce.Stage.MAP
stage processes the vertices of theGraph
in a logically parallel manner. TheMapReduce.Stage.COMBINE
stage aggregates the values of a particular map emitted key prior to sending across the cluster. TheMapReduce.Stage.REDUCE
stage aggregates the values of the combine/map emitted keys for the keys that hash to the current machine in the cluster. The interface presented here is nearly identical to the interface popularized by Hadoop save the map() is over the vertices of the graph.- Author:
- Marko A. Rodriguez (http://markorodriguez.com)
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static interface
MapReduce.MapEmitter<K,V>
The MapEmitter is used to emit key/value pairs from the map() stage of the MapReduce job.static class
MapReduce.NullObject
A convenience singleton when a single key is needed so that all emitted values converge to the same combiner/reducer.static interface
MapReduce.ReduceEmitter<OK,OV>
The ReduceEmitter is used to emit key/value pairs from the combine() and reduce() stages of the MapReduce job.static class
MapReduce.Stage
MapReduce is composed of three stages: map, combine, and reduce.
-
Field Summary
Fields Modifier and Type Field Description static String
MAP_REDUCE
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description default void
addResultToMemory(Memory.Admin memory, Iterator<KeyValue<RK,RV>> keyValues)
The final result can be generated and added toMemory
and accessible viaDefaultComputerResult
.MapReduce<MK,MV,RK,RV,R>
clone()
When multiple workers on a single machine need MapReduce instances, it is possible to use clone.default void
combine(MK key, Iterator<MV> values, MapReduce.ReduceEmitter<RK,RV> emitter)
The combine() method is logically executed at all "machines" in parallel.static <M extends MapReduce>
McreateMapReduce(Graph graph, org.apache.commons.configuration2.Configuration configuration)
A helper method to construct aMapReduce
given the content of the supplied configuration.boolean
doStage(MapReduce.Stage stage)
A MapReduce job can be map-only, map-reduce-only, or map-combine-reduce.R
generateFinalResult(Iterator<KeyValue<RK,RV>> keyValues)
The key/value pairs emitted by reduce() (or map() in a map-only job) can be iterated to generate a local JVM Java object.default Optional<Comparator<MK>>
getMapKeySort()
If aComparator
is provided, then all pairs leaving theMapReduce.MapEmitter
are sorted.String
getMemoryKey()
The results of the MapReduce job are associated with a memory-key to ultimately be stored inMemory
.default Optional<Comparator<RK>>
getReduceKeySort()
If aComparator
is provided, then all pairs leaving theMapReduce.ReduceEmitter
are sorted.default void
loadState(Graph graph, org.apache.commons.configuration2.Configuration configuration)
When it is necessary to load the state of a MapReduce job, this method is called.void
map(Vertex vertex, MapReduce.MapEmitter<MK,MV> emitter)
The map() method is logically executed at all vertices in the graph in parallel.default void
reduce(MK key, Iterator<MV> values, MapReduce.ReduceEmitter<RK,RV> emitter)
The reduce() method is logically on the "machine" the respective key hashes to.default void
storeState(org.apache.commons.configuration2.Configuration configuration)
When it is necessary to store the state of a MapReduce job, this method is called.default void
workerEnd(MapReduce.Stage stage)
This method is called at the end of the respectiveMapReduce.Stage
for a particular "chunk of vertices." The set of vertices in the graph are typically not processed with full parallelism.default void
workerStart(MapReduce.Stage stage)
This method is called at the start of the respectiveMapReduce.Stage
for a particular "chunk of vertices." The set of vertices in the graph are typically not processed with full parallelism.
-
-
-
Field Detail
-
MAP_REDUCE
static final String MAP_REDUCE
- See Also:
- Constant Field Values
-
-
Method Detail
-
storeState
default void storeState(org.apache.commons.configuration2.Configuration configuration)
When it is necessary to store the state of a MapReduce job, this method is called. This is typically required when the MapReduce job needs to be serialized to another machine. Note that what is stored is simply the instance state, not any processed data.- Parameters:
configuration
- the configuration to store the state of the MapReduce job in.
-
loadState
default void loadState(Graph graph, org.apache.commons.configuration2.Configuration configuration)
When it is necessary to load the state of a MapReduce job, this method is called. This is typically required when the MapReduce job needs to be serialized to another machine. Note that what is loaded is simply the instance state, not any processed data. It is important that the state loaded from loadState() is identical to any state created from a constructor. For those GraphComputers that do not need to use Configurations to migrate state between JVMs, the constructor will only be used.- Parameters:
graph
- the graph the MapReduce job will run againstconfiguration
- the configuration to load the state of the MapReduce job from.
-
doStage
boolean doStage(MapReduce.Stage stage)
A MapReduce job can be map-only, map-reduce-only, or map-combine-reduce. Before executing the particular stage, this method is called to determine if the respective stage is defined. This method should return true if the respective stage as a non-default method implementation.- Parameters:
stage
- the stage to check for definition.- Returns:
- whether that stage should be executed.
-
map
void map(Vertex vertex, MapReduce.MapEmitter<MK,MV> emitter)
The map() method is logically executed at all vertices in the graph in parallel. The map() method emits key/value pairs given some analysis of the data in the vertices (and/or its incident edges). AllMapReduce
classes must at least provide an implementation ofMapReduce#map(Vertex, MapEmitter)
.- Parameters:
vertex
- the current vertex being map() processed.emitter
- the component that allows for key/value pairs to be emitted to the next stage.
-
combine
default void combine(MK key, Iterator<MV> values, MapReduce.ReduceEmitter<RK,RV> emitter)
The combine() method is logically executed at all "machines" in parallel. The combine() method pre-combines the values for a key prior to propagation over the wire. The combine() method must emit the same key/value pairs as the reduce() method. If there is a combine() implementation, there must be a reduce() implementation. If the MapReduce implementation is single machine, it can skip executing this method as reduce() is sufficient.- Parameters:
key
- the key that has aggregated valuesvalues
- the aggregated values associated with the keyemitter
- the component that allows for key/value pairs to be emitted to the reduce stage.
-
reduce
default void reduce(MK key, Iterator<MV> values, MapReduce.ReduceEmitter<RK,RV> emitter)
The reduce() method is logically on the "machine" the respective key hashes to. The reduce() method combines all the values associated with the key and emits key/value pairs.- Parameters:
key
- the key that has aggregated valuesvalues
- the aggregated values associated with the keyemitter
- the component that allows for key/value pairs to be emitted as the final result.
-
workerStart
default void workerStart(MapReduce.Stage stage)
This method is called at the start of the respectiveMapReduce.Stage
for a particular "chunk of vertices." The set of vertices in the graph are typically not processed with full parallelism. The vertex set is split into subsets and a worker is assigned to call the MapReduce methods on it method. The default implementation is a no-op.- Parameters:
stage
- the stage of the MapReduce computation
-
workerEnd
default void workerEnd(MapReduce.Stage stage)
This method is called at the end of the respectiveMapReduce.Stage
for a particular "chunk of vertices." The set of vertices in the graph are typically not processed with full parallelism. The vertex set is split into subsets and a worker is assigned to call the MapReduce methods on it method. The default implementation is a no-op.- Parameters:
stage
- the stage of the MapReduce computation
-
getMapKeySort
default Optional<Comparator<MK>> getMapKeySort()
If aComparator
is provided, then all pairs leaving theMapReduce.MapEmitter
are sorted. The sorted results are either fed sorted to the combine/reduce-stage or as the final output. If sorting is not required, thenOptional.empty()
should be returned as sorting is computationally expensive. The default implementation returnsOptional.empty()
.- Returns:
- an
Optional
of a comparator for sorting the map output.
-
getReduceKeySort
default Optional<Comparator<RK>> getReduceKeySort()
If aComparator
is provided, then all pairs leaving theMapReduce.ReduceEmitter
are sorted. If sorting is not required, thenOptional.empty()
should be returned as sorting is computationally expensive. The default implementation returnsOptional.empty()
.- Returns:
- an
Optional
of a comparator for sorting the reduce output.
-
generateFinalResult
R generateFinalResult(Iterator<KeyValue<RK,RV>> keyValues)
The key/value pairs emitted by reduce() (or map() in a map-only job) can be iterated to generate a local JVM Java object.- Parameters:
keyValues
- the key/value pairs that were emitted from reduce() (or map() in a map-only job)- Returns:
- the resultant object formed from the emitted key/values.
-
getMemoryKey
String getMemoryKey()
The results of the MapReduce job are associated with a memory-key to ultimately be stored inMemory
.- Returns:
- the memory key of the generated result object.
-
addResultToMemory
default void addResultToMemory(Memory.Admin memory, Iterator<KeyValue<RK,RV>> keyValues)
The final result can be generated and added toMemory
and accessible viaDefaultComputerResult
. The default simply takes the object from generateFinalResult() and adds it to the Memory given getMemoryKey().- Parameters:
memory
- the memory of theGraphComputer
keyValues
- the key/value pairs emitted from reduce() (or map() in a map only job).
-
clone
MapReduce<MK,MV,RK,RV,R> clone()
When multiple workers on a single machine need MapReduce instances, it is possible to use clone. This will provide a speedier way of generating instances, over thestoreState(org.apache.commons.configuration2.Configuration)
andloadState(org.apache.tinkerpop.gremlin.structure.Graph, org.apache.commons.configuration2.Configuration)
model. The default implementation simply returns the object as it assumes that the MapReduce instance is a stateless singleton.- Returns:
- A clone of the MapReduce object
-
createMapReduce
static <M extends MapReduce> M createMapReduce(Graph graph, org.apache.commons.configuration2.Configuration configuration)
A helper method to construct aMapReduce
given the content of the supplied configuration. The class of the MapReduce is read from theMAP_REDUCE
static configuration key. Once the MapReduce is constructed,loadState(org.apache.tinkerpop.gremlin.structure.Graph, org.apache.commons.configuration2.Configuration)
method is called with the provided configuration.- Parameters:
graph
- The graph that the MapReduce job will run againstconfiguration
- A configuration with requisite information to build a MapReduce- Returns:
- the newly constructed MapReduce
-
-