PowerIterationClustering¶

class pyspark.ml.clustering.PowerIterationClustering(*, k=2, maxIter=20, initMode='random', srcCol='src', dstCol='dst', weightCol=None)[source]¶

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters() method to run the PowerIterationClustering algorithm.

New in version 2.4.0.

Notes

See Wikipedia on Spectral clustering

Examples

>>> data = [(1, 0, 0.5),
...         (2, 0, 0.5), (2, 1, 0.7),
...         (3, 0, 0.5), (3, 1, 0.7), (3, 2, 0.9),
...         (4, 0, 0.5), (4, 1, 0.7), (4, 2, 0.9), (4, 3, 1.1),
...         (5, 0, 0.5), (5, 1, 0.7), (5, 2, 0.9), (5, 3, 1.1), (5, 4, 1.3)]
>>> df = spark.createDataFrame(data).toDF("src", "dst", "weight").repartition(1)
>>> pic = PowerIterationClustering(k=2, weightCol="weight")
>>> pic.setMaxIter(40)
PowerIterationClustering...
>>> assignments = pic.assignClusters(df)
>>> assignments.sort(assignments.id).show(truncate=False)
+---+-------+
|id |cluster|
+---+-------+
|0  |0      |
|1  |0      |
|2  |0      |
|3  |0      |
|4  |0      |
|5  |1      |
+---+-------+
...
>>> pic_path = temp_path + "/pic"
>>> pic.save(pic_path)
>>> pic2 = PowerIterationClustering.load(pic_path)
>>> pic2.getK()
2
>>> pic2.getMaxIter()
40
>>> pic2.assignClusters(df).take(6) == assignments.take(6)
True

Methods

`assignClusters`(dataset)	Run the PIC algorithm and returns a cluster assignment for each input vertex.
`clear`(param)	Clears a param from the param map if it has been explicitly set.
`copy`([extra])	Creates a copy of this instance with the same uid and some extra params.
`explainParam`(param)	Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
`explainParams`()	Returns the documentation of all params with their optionally default values and user-supplied values.
`extractParamMap`([extra])	Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
`getDstCol`()	Gets the value of `dstCol` or its default value.
`getInitMode`()	Gets the value of `initMode` or its default value.
`getK`()	Gets the value of `k` or its default value.
`getMaxIter`()	Gets the value of maxIter or its default value.
`getOrDefault`(param)	Gets the value of a param in the user-supplied param map or its default value.
`getParam`(paramName)	Gets a param by its name.
`getSrcCol`()	Gets the value of `srcCol` or its default value.
`getWeightCol`()	Gets the value of weightCol or its default value.
`hasDefault`(param)	Checks whether a param has a default value.
`hasParam`(paramName)	Tests whether this instance contains a param with a given (string) name.
`isDefined`(param)	Checks whether a param is explicitly set by user or has a default value.
`isSet`(param)	Checks whether a param is explicitly set by user.
`load`(path)	Reads an ML instance from the input path, a shortcut of read().load(path).
`read`()	Returns an MLReader instance for this class.
`save`(path)	Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
`set`(param, value)	Sets a parameter in the embedded param map.
`setDstCol`(value)	Sets the value of `dstCol`.
`setInitMode`(value)	Sets the value of `initMode`.
`setK`(value)	Sets the value of `k`.
`setMaxIter`(value)	Sets the value of `maxIter`.
`setParams`(self, \*[, k, maxIter, initMode, …])	Sets params for PowerIterationClustering.
`setSrcCol`(value)	Sets the value of `srcCol`.
`setWeightCol`(value)	Sets the value of `weightCol`.
`write`()	Returns an MLWriter instance for this ML instance.

Attributes

`dstCol`
`initMode`
`k`
`maxIter`
`params`	Returns all params ordered by name.
`srcCol`
`weightCol`

Methods Documentation

assignClusters(dataset)[source]¶

Run the PIC algorithm and returns a cluster assignment for each input vertex.

Parameters

datasetpyspark.sql.DataFrame: A dataset with columns src, dst, weight representing the affinity matrix, which is the matrix A in the PIC paper. Suppose the src column value is i, the dst column value is j, the weight column value is similarity s,,ij,, which must be nonnegative. This is a symmetric matrix and hence s,,ij,, = s,,ji,,. For any (i, j) with nonzero similarity, there should be either (i, j, s,,ij,,) or (j, i, s,,ji,,) in the input. Rows with i = j are ignored, because we assume s,,ij,, = 0.0.

Returns

pyspark.sql.DataFrame: A dataset that contains columns of vertex id and the corresponding cluster for the id. The schema of it will be: - id: Long - cluster: Int

New in version 2.4.0: ..

clear(param)¶: Clears a param from the param map if it has been explicitly set.

copy(extra=None)¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extradict, optional: Extra parameters to copy to the new instance

Returns

JavaParams: Copy of this instance

explainParam(param)¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extradict, optional: extra param values

Returns

dict: merged param map

getDstCol()¶: Gets the value of dstCol or its default value.

New in version 2.4.0.

getInitMode()¶: Gets the value of initMode or its default value.

New in version 2.4.0.

getK()¶: Gets the value of k or its default value.

New in version 2.4.0.

getMaxIter()¶: Gets the value of maxIter or its default value.

getOrDefault(param)¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)¶: Gets a param by its name.

getSrcCol()¶: Gets the value of srcCol or its default value.

New in version 2.4.0.

getWeightCol()¶: Gets the value of weightCol or its default value.

hasDefault(param)¶: Checks whether a param has a default value.

hasParam(paramName)¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param)¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param)¶: Checks whether a param is explicitly set by user.

classmethod load(path)¶: Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()¶: Returns an MLReader instance for this class.

save(path)¶: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)¶: Sets a parameter in the embedded param map.

setDstCol(value)[source]¶: Sets the value of dstCol.

New in version 2.4.0.

setInitMode(value)[source]¶: Sets the value of initMode.

New in version 2.4.0.

setK(value)[source]¶: Sets the value of k.

New in version 2.4.0.

setMaxIter(value)[source]¶: Sets the value of maxIter.

New in version 2.4.0.

setParams(self, \*, k=2, maxIter=20, initMode="random", srcCol="src", dstCol="dst", weightCol=None)[source]¶: Sets params for PowerIterationClustering.

New in version 2.4.0.

setSrcCol(value)[source]¶: Sets the value of srcCol.

New in version 2.4.0.

setWeightCol(value)[source]¶: Sets the value of weightCol.

New in version 2.4.0.

write()¶: Returns an MLWriter instance for this ML instance.

Attributes Documentation

dstCol = Param(parent='undefined', name='dstCol', doc='Name of the input column for destination vertex IDs.')¶

initMode = Param(parent='undefined', name='initMode', doc="The initialization algorithm. This can be either 'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum of similarities with other vertices. Supported options: 'random' and 'degree'.")¶

k = Param(parent='undefined', name='k', doc='The number of clusters to create. Must be > 1.')¶

maxIter = Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

srcCol = Param(parent='undefined', name='srcCol', doc='Name of the input column for source vertex IDs.')¶

weightCol = Param(parent='undefined', name='weightCol', doc='weight column name. If this is not set or empty, we treat all instance weights as 1.0.')¶

DistributedLDAModel pyspark.ml.functions.array_to_vector