diff --git a/README.md b/README.md index f0c45fb1d0d24badc7ff359c2c41ee1e2b06b5e3..3b251ad5d80425125f4458c5dd18e847e3ae30c2 100644 --- a/README.md +++ b/README.md @@ -29,8 +29,10 @@ In short: ## Experiments +### Basic Quantization + Our goal is to use TVM compile a quantized ResNet50 down to a exectuable function; -for brevity, it's not yet necessary to test the accuracy of the compiled model. +for simplicity, it's not yet necessary to test the accuracy of the compiled model. - No usage of dataset is necessary at any point in these experiments; when TVM calls for inputs to the model for compilation / benchmarking purposes, @@ -43,31 +45,37 @@ and the [TVM discussion board](https://discuss.tvm.apache.org) may have more adv There are roughly the following few major steps: -1. Get an instance of a ResNet50 model implemented in PyTorch. - It's available in the `torchvision` package and as easy to get as a function call (remember to install the package first). +1. Get an instance of a ResNet50 model implemented in PyTorch. It's available in the `torchvision` package. -2. It may be a good idea to try using TVM on plain (un-quantized) DNN first. - Give TVM the network and a sample input the network takes, to compile the network into a function object that can be called from Python side and gives outputs. +2. It's a good idea to try TVM on an un-quantized DNN first. + Give TVM the network and a sample input to the network, + and compile the network into a function object that can be called from Python side to produce DNN outputs. The TVM how-to guides has complete tutorials on how to do this step. - Pay attention to *which hardware (CPU? GPU?) the model is being compiled for* and how to specify it. + Pay attention to the compilation **target**: + which hardware (CPU? GPU?) the model is being compiled for, and understand how to specify it. + Compile for GPU, if you have one, or CPU otherwise. 3. Now, quantize the model down to **int8** precision. TVM itself has utilities to quantize a DNN before compilation; - you can find how-to in the guides and forum. + you can find how-tos in the guides and forum. + Again, you should get a function object that can be called from Python side. + + **Hint**: there is a namespace `tvm.relay.quantize` and everything you need is somewhere in there. - Do this for the GPU (if you have one), or CPU otherwise. - Use TVM utils to benchmark the inference time of the quantized model vs. the un-quantized model. +4. Just for your own check -- how can you see the TVM code in the compiled module? + Did the quantization actually happen, for example, did the datatypes in the code change? - We're not (yet) looking to maximize the performance of the DNN with quantization, - but if there is no speedup, you should look into it and form your own guess. +5. Use TVM's utility functions to benchmark the inference time of the quantized model vs. the un-quantized model. - - Hint: TVM may print the following only for the quantized case, or for both -- what does it mean? - > One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. + In this task we will not try to maximize the performance of the quantized DNN, + but if there is no speedup, you should try to understand it and formulate a guess. + **Hint**: TVM may print the following when you compile the DNN -- what does it mean? + > One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. -4. [Bonus] If you used `qconfig` for the previous part, look into how to change the quantization precision, - which is the number of bits of quantization (the $n$ in int-$n$), - by looking at the source code of `class QConfig` or search on forum. +6. In your quantization setup, how did TVM know that you wanted to quantize to int8? + Look into that, and vary the number of bits of quantization (the $n$ in int-$n$). + Searching in forum and peeking the source code of the quantizer class will both help. - Go down `int8` -> `int4` -> `int2` -> `int1 (bool)`, then followed by non-power-of-2 bits (`int7`, `int6`...), - and investigate what is supported by TVM and what is failing when it doesn't work. + Try out `int8` -> `int4` -> `int2` -> `int1`; note which precisions work. + When it doesn't work, note exactly which part is failing.