Example and usage¶

In order to make things simple the following rules have been followed during development:

deel-lip follows the keras package structure.
All elements (layers, activations, initializers, …) are compatible with standard the keras elements.
When a k-Lipschitz layer overrides a standard keras layer, it uses the same interface and the same parameters. The only difference is a new parameter to control the Lipschitz constant of a layer.

Which layers are safe to use?¶

The following table indicates which layers are safe to use in a Lipshitz network, and which are not.

layer	1-lip?	deel-lip equivalent	comments
`Dense`	no	`SpectralDense` `FrobeniusDense`	`SpectralDense` and `FrobeniusDense` are similar when there is a single output.
`Conv2D`	no	`SpectralConv2D` `FrobeniusConv2D`	`SpectralConv2D` also implements Björck normalization.
`MaxPooling` `GlobalMaxPooling`	yes	n/a
`AveragePooling2D` `GlobalAveragePooling2D`	no	`ScaledAveragePooling2D` `ScaledGlobalAveragePooling2D`	The lipschitz constant is bounded by `sqrt(pool_h * pool_h)`.
`Flatten`	yes	n/a
`Dropout`	no	None	The lipschitz constant is bounded by the dropout factor.
`BatchNorm`	no	None	We suspect that layer normalization already limits internal covariate shift.

Designing lipschitz networks require a careful design in order to avoid vanishing/exploding gradient problem.

Choosing pooling layers:

layer	advantages	disadvantages
`ScaledAveragePooling2D` and `MaxPooling2D`	very similar to original implementation (just add a scaling factor for avg).	not norm preserving nor gradient norm preserving.
`InvertibleDownSampling`	norm preserving and gradient norm preserving.	increases the number of channels (and the number of parameters of the next layer).
`ScaledL2NormPooling2D` ( sqrt(avgpool(x**2)) )	norm preserving.	lower numerical stability of the gradient when inputs are close to zero.

Choosing activations:

layer	advantages	disadvantages
`ReLU`		create a strong vanishing gradient effect. If you manage to learn with it, please call 911.
`MaxMin` (stack([ReLU(x), ReLU(-x)]))	have similar properties to ReLU, but is norm and gradient norm preserving	double the number of outputs
`GroupSort`	Input and GradientNorm preserving. Also limit the need of biases (as it is shift invariant).	more computationally expensive, (when it’s parameter n is large)

Please note that when learning with the HKR_loss and HKR_multiclass_loss, no activation is required on the last layer.

Here is an example showing how to build and train a 1-Lipschitz network: