Example and usage

In order to make things simple the following rules have been followed during development:

  • deel-lip follows the keras package structure.

  • All elements (layers, activations, initializers, …) are compatible with standard the keras elements.

  • When a k-Lipschitz layer overrides a standard keras layer, it uses the same interface and the same parameters. The only difference is a new parameter to control the Lipschitz constant of a layer.

Which layers are safe to use?

The following table indicates which layers are safe to use in a Lipshitz network, and which are not.

layer

1-lip?

deel-lip equivalent

comments

Dense

no

SpectralDense
FrobeniusDense

SpectralDense and FrobeniusDense are similar when there is a single output.

Conv2D

no

SpectralConv2D
FrobeniusConv2D

SpectralConv2D also implements Björck normalization.

MaxPooling
GlobalMaxPooling

yes

n/a

AveragePooling2D
GlobalAveragePooling2D

no

ScaledAveragePooling2D
ScaledGlobalAveragePooling2D

The lipschitz constant is bounded by sqrt(pool_h * pool_h).

Flatten

yes

n/a

Dropout

no

None

The lipschitz constant is bounded by the dropout factor.

BatchNorm

no

None

We suspect that layer normalization already limits internal covariate shift.

Design tips

Designing lipschitz networks require a careful design in order to avoid vanishing/exploding gradient problem.

Choosing pooling layers:

layer

advantages

disadvantages

ScaledAveragePooling2D and MaxPooling2D

very similar to original implementation (just add a scaling factor for avg).

not norm preserving nor gradient norm preserving.

InvertibleDownSampling

norm preserving and gradient norm preserving.

increases the number of channels (and the number of parameters of the next layer).

ScaledL2NormPooling2D ( sqrt(avgpool(x**2)) )

norm preserving.

lower numerical stability of the gradient when inputs are close to zero.

Choosing activations:

layer

advantages

disadvantages

ReLU

create a strong vanishing gradient effect. If you manage to learn with it, please call 911.

MaxMin (stack([ReLU(x), ReLU(-x)]))

have similar properties to ReLU, but is norm and gradient norm preserving

double the number of outputs

GroupSort

Input and GradientNorm preserving. Also limit the need of biases (as it is shift invariant).

more computationally expensive, (when it’s parameter n is large)

Please note that when learning with the HKR_loss and HKR_multiclass_loss, no activation is required on the last layer.

How to use it?

Here is an example showing how to build and train a 1-Lipschitz network: