Skip to content

Batch Learning on task graph

xiaoyunwu edited this page Jan 9, 2015 · 6 revisions

Implement Sparks like batch learning system on the task graph will be a useful exercise for us : first test our implementation, second have a user level package that we can talk about to drum up the interest.

There are two kinds of task that we need to implement: parameter server, and data slaves (each hold a segment of data). The parameter server is responsible for holding the parameters and carry out the optimization. Note that the optimization is an iterative process on parameters, say we use the gradient descent. At each epoch, the current parameter (shared variable in Spark) will be passed onto each data slaves through a task topology. Each data slave computes the gradient of loss function on its data shard at current parameter, and reduce the gradient (accumulator, can be implemented via buffered channels easily) from its child back to its parent.

The data slaves can provide RDD support naturally. Application define how their data will be processed as a pipeline from the source. So every time a node recovers from failure, it will naturally rerun the code (in task init) that bring the data from the source and process that into some in memory data structure, before it listens to parent meta ready.

See also a old page: https://github.com/go-distributed/meritop/wiki/implement-Spark-Contract-on-taskgraph

Question On RDD design in Go (@Xiang, @Hongchao). The central idea of RDD is about drawing Datum from source and transform it into desired form, and maybe store it in a random accessible form. In old days where we have template, we can simply chain together transformer:

class Transformer<OldType, NewType> {
NewType transform(OldType datum);
}

from class Iterator {
bool hasNext();
OldType next();
}

And potentially have the ended up in linearly accessible cache.

One of the question that I don't know is without template, what is correct way of do this. I can we can simply use struct{} Dutum, and cast them when need it. So it become:

DatumIteratorBuilder will take a path and build a DatumIterator

type Datum struct {}
type DatumIterator interface {
HasNext() bool
Next() Datum
}

type DatumIteratorBuilder interface {
Build(path string) DatumIterator
}

type DatumTransformer interface {
Transform(old Datum) Datum
}

type DatumStore struct {
Cache []Datum
}

What do you guys think? I think we need to put this in a separate directory, but note that this can be useful not just for spark, but other project as well.

Clone this wiki locally