Example and usage¶
In order to make things simple the following rules have been followed during development:
deel-lip
follows thekeras
package structure.All elements (layers, activations, initializers, …) are compatible with standard the
keras
elements.When a k-Lipschitz layer overrides a standard keras layer, it uses the same interface and the same parameters. The only difference is a new parameter to control the Lipschitz constant of a layer.
Which layers are safe to use?¶
The following table indicates which layers are safe to use in a Lipshitz network, and which are not.
layer |
1-lip? |
deel-lip equivalent |
comments |
---|---|---|---|
|
no |
|
|
|
no |
|
|
|
yes |
n/a |
|
|
no |
The lipschitz constant is bounded by |
|
|
yes |
n/a |
|
|
no |
None |
The lipschitz constant is bounded by the dropout factor. |
|
no |
None |
We suspect that layer normalization already limits internal covariate shift. |
Design tips¶
Designing lipschitz networks require a careful design in order to avoid vanishing/exploding gradient problem.
Choosing pooling layers:
layer |
advantages |
disadvantages |
---|---|---|
|
very similar to original implementation (just add a scaling factor for avg). |
not norm preserving nor gradient norm preserving. |
norm preserving and gradient norm preserving. |
increases the number of channels (and the number of parameters of the next layer). |
|
|
norm preserving. |
lower numerical stability of the gradient when inputs are close to zero. |
Choosing activations:
layer |
advantages |
disadvantages |
---|---|---|
|
create a strong vanishing gradient effect. If you manage to learn with it, please call 911. |
|
|
have similar properties to ReLU, but is norm and gradient norm preserving |
double the number of outputs |
Input and GradientNorm preserving. Also limit the need of biases (as it is shift invariant). |
more computationally expensive, (when it’s parameter n is large) |
Please note that when learning with the HKR_loss
and HKR_multiclass_loss
, no activation is
required on the last layer.
How to use it?¶
Here is an example showing how to build and train a 1-Lipschitz network: