.gitmodules · sym-ansor · Yifan Zhao / tvm-fork

3 years ago

[BYOC] CUTLASS integration (#9261) · 541f9f2d

Leyuan Wang authored 3 years ago


* byoc cutlass

* add cmake and fix build

* test worked but accuracy is bad

* fixed argument printing properly

* moving files

* moving contents of cutlass_profiler into python/tvm/contrib/cutlass

* run black

* remove irrelavant codegen code

* clang format

* tried replacing sm 75 with 80, didn't help improve accuracy

* remove irrelavant code from generator

* tried dense + bias fusion but generated cu file does not compile

* dense + bias worked after adding Leyuan's patch, bias + relu worked too

* tried adding sm80 generator but accuracy is still off

* remove GemmUniversal generator

* cleanup partition and build

* moved partition, profile and build function out of test

* turned out the result match's TVM non-cutlass result. Numpy fp16
matmul is busted?

* clean up test

* LinearCombination can be reused for bias only epilogue

* remove unsupported epilogues like gelu

* removing deadcode

* unify gemm templates for with or without beta scaling

* supported gelu but accuracy is slightly off

* gelu test passed with relaxed rtol

* cleanup

* remove unused stuff from library.py

* move profiler template into its own file

* removed gemm_profiler.py

* move contents of compile_engine.py into gen_gemm.py

* rename to profiler_template.cu to avoid CI issue

* cleaning up trying to pass pylint

* add missing asf header

* run black

* fixing many pylint issues except wildcard import

* fixed wildcard warning

* add missing CUTLASS.cmake file, restore gemm_profiler.py

* pylint

* minor fix

* add license

* start filling in TODO doc

* rename GemmProfiler to GemmProfilerEmitter

* more renaming and doc

* add doc to the main compile API

* refactored generator

* run black

* black fix

* finish doc TODO

* add test for 32 bit accum

* fixed kernel generator to correctly handle fp32 accum

* revise build-related API

* add option to profile only one kernel

* add option to enable parallel compilation

* clean up gen_gemm

* doc update

* profile_cutlass_kernels -> tune_cutlass_kernels

Co-authored-by: leyuan.wang <leyuan.wang@bytedance.com>
Co-authored-by: Masahiro Masuda <masahi129@gmail.com>

Unverified

541f9f2d

History

[BYOC] CUTLASS integration (#9261)

Leyuan Wang authored 3 years ago


* byoc cutlass

* add cmake and fix build

* test worked but accuracy is bad

* fixed argument printing properly

* moving files

* moving contents of cutlass_profiler into python/tvm/contrib/cutlass

* run black

* remove irrelavant codegen code

* clang format

* tried replacing sm 75 with 80, didn't help improve accuracy

* remove irrelavant code from generator

* tried dense + bias fusion but generated cu file does not compile

* dense + bias worked after adding Leyuan's patch, bias + relu worked too

* tried adding sm80 generator but accuracy is still off

* remove GemmUniversal generator

* cleanup partition and build

* moved partition, profile and build function out of test

* turned out the result match's TVM non-cutlass result. Numpy fp16
matmul is busted?

* clean up test

* LinearCombination can be reused for bias only epilogue

* remove unsupported epilogues like gelu

* removing deadcode

* unify gemm templates for with or without beta scaling

* supported gelu but accuracy is slightly off

* gelu test passed with relaxed rtol

* cleanup

* remove unused stuff from library.py

* move profiler template into its own file

* removed gemm_profiler.py

* move contents of compile_engine.py into gen_gemm.py

* rename to profiler_template.cu to avoid CI issue

* cleaning up trying to pass pylint

* add missing asf header

* run black

* fixing many pylint issues except wildcard import

* fixed wildcard warning

* add missing CUTLASS.cmake file, restore gemm_profiler.py

* pylint

* minor fix

* add license

* start filling in TODO doc

* rename GemmProfiler to GemmProfilerEmitter

* more renaming and doc

* add doc to the main compile API

* refactored generator

* run black

* black fix

* finish doc TODO

* add test for 32 bit accum

* fixed kernel generator to correctly handle fp32 accum

* revise build-related API

* add option to profile only one kernel

* add option to enable parallel compilation

* clean up gen_gemm

* doc update

* profile_cutlass_kernels -> tune_cutlass_kernels

Co-authored-by: leyuan.wang <leyuan.wang@bytedance.com>
Co-authored-by: Masahiro Masuda <masahi129@gmail.com>