Skip to content
Snippets Groups Projects
  1. Aug 14, 2014
    • Reynold Xin's avatar
      Make dev/mima runnable on Mac OS X. · fa5a08e6
      Reynold Xin authored
      Mac OS X's find is from the BSD variant that doesn't have -printf option.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1953 from rxin/mima and squashes the following commits:
      
      e284afe [Reynold Xin] Make dev/mima runnable on Mac OS X.
      fa5a08e6
    • Jacek Lewandowski's avatar
      SPARK-3009: Reverted readObject method in ApplicationInfo so that Applic... · a75bc7a2
      Jacek Lewandowski authored
      ...ationInfo is initialized properly after deserialization
      
      Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
      
      Closes #1947 from jacek-lewandowski/master and squashes the following commits:
      
      713b2f1 [Jacek Lewandowski] SPARK-3009: Reverted readObject method in ApplicationInfo so that ApplicationInfo is initialized properly after deserialization
      a75bc7a2
    • Michael Armbrust's avatar
      Revert [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile · a7f8a4f5
      Michael Armbrust authored
      Reverts #1924 due to build failures with hadoop 0.23.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1949 from marmbrus/revert1924 and squashes the following commits:
      
      6bff940 [Michael Armbrust] Revert "[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile"
      a7f8a4f5
    • DB Tsai's avatar
      [SPARK-2979][MLlib] Improve the convergence rate by minimizing the condition number · 96221067
      DB Tsai authored
      In theory, the scale of your inputs are irrelevant to logistic regression.
      You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
      adjust accordingly. It will be 1E-6 times smaller than the original β1, due
      to the invariance property of MLEs.
      
      However, during the optimization process, the convergence (rate)
      depends on the condition number of the training dataset. Scaling
      the variables often reduces this condition number, thus improving
      the convergence rate.
      
      Without reducing the condition number, some training datasets
      mixing the columns with different scales may not be able to converge.
      
      GLMNET and LIBSVM packages perform the scaling to reduce
      the condition number, and return the weights in the original scale.
      See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
      
      Here, if useFeatureScaling is enabled, we will standardize the training
      features by dividing the variance of each column (without subtracting
      the mean to densify the sparse vector), and train the model in the
      scaled space. Then we transform the coefficients from the scaled space
      to the original scale as GLMNET and LIBSVM do.
      
      Currently, it's only enabled in LogisticRegressionWithLBFGS.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
      
      f19fc02 [DB Tsai] Added more comments
      1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS
      96221067
    • Reynold Xin's avatar
      Minor cleanup of metrics.Source · eaeb0f76
      Reynold Xin authored
      - Added override.
      - Marked some variables as private.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1943 from rxin/metricsSource and squashes the following commits:
      
      fbfa943 [Reynold Xin] Minor cleanup of metrics.Source. - Added override. - Marked some variables as private.
      eaeb0f76
    • wangfei's avatar
      [SPARK-2925] [sql]fix spark-sql and start-thriftserver shell bugs when set --driver-java-options · 267fdffe
      wangfei authored
      https://issues.apache.org/jira/browse/SPARK-2925
      
      Run cmd like this will get the error
      bin/spark-sql --driver-java-options '-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,address=8788,server=y,suspend=y'
      
      Error: Unrecognized option '-Xnoagent'.
      Run with --help for usage help or --verbose for debug output
      
      Author: wangfei <wangfei_hello@126.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #1851 from scwf/patch-2 and squashes the following commits:
      
      516554d [wangfei] quote variables to fix this issue
      8bd40f2 [wangfei] quote variables to fix this problem
      e6d79e3 [wangfei] fix start-thriftserver bug when set driver-java-options
      948395d [wangfei] fix spark-sql error when set --driver-java-options
      267fdffe
    • Ahir Reddy's avatar
      [SQL] Python JsonRDD UTF8 Encoding Fix · fde692b3
      Ahir Reddy authored
      Only encode unicode objects to UTF-8, and not strings
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:
      
      ca4e9ba [Ahir Reddy] Encoding Fix
      fde692b3
    • Yin Huai's avatar
      [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns... · add75d48
      Yin Huai authored
      [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns
      
      This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2927
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits:
      
      689ffa9 [Yin Huai] Add missing "=".
      80827de [Yin Huai] Unit test.
      1765ca4 [Yin Huai] Use .toBoolean.
      9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
      5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
      add75d48
    • Chia-Yung Su's avatar
      [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile · 078f3fbd
      Chia-Yung Su authored
      Author: Chia-Yung Su <chiayung@appier.com>
      
      Closes #1924 from joesu/bugfix-spark3011 and squashes the following commits:
      
      c7e44f2 [Chia-Yung Su] match syntax
      f8fc32a [Chia-Yung Su] filter out tmp dir
      078f3fbd
    • Graham Dennis's avatar
      SPARK-2893: Do not swallow Exceptions when running a custom kryo registrator · 6b8de0e3
      Graham Dennis authored
      The previous behaviour of swallowing ClassNotFound exceptions when running a custom Kryo registrator could lead to difficult to debug problems later on at serialisation / deserialisation time, see SPARK-2878.  Instead it is better to fail fast.
      
      Added test case.
      
      Author: Graham Dennis <graham.dennis@gmail.com>
      
      Closes #1827 from GrahamDennis/feature/spark-2893 and squashes the following commits:
      
      fbe4cb6 [Graham Dennis] [SPARK-2878]: Update the test case to match the updated exception message
      65e53c5 [Graham Dennis] [SPARK-2893]: Improve message when a spark.kryo.registrator fails.
      f480d85 [Graham Dennis] [SPARK-2893] Fix typo.
      b59d2c2 [Graham Dennis] SPARK-2893: Do not swallow Exceptions when running a custom spark.kryo.registrator
      6b8de0e3
    • Aaron Davidson's avatar
      [SPARK-3029] Disable local execution of Spark jobs by default · d069c5d9
      Aaron Davidson authored
      Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.
      
      Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.
      
      This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1321 from aarondav/allowlocal and squashes the following commits:
      
      136b253 [Aaron Davidson] Fix DAGSchedulerSuite
      5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
      d069c5d9
    • Xiangrui Meng's avatar
      [SPARK-2995][MLLIB] add ALS.setIntermediateRDDStorageLevel · 69a57a18
      Xiangrui Meng authored
      As mentioned in SPARK-2465, using `MEMORY_AND_DISK_SER` for user/product in/out links together with `spark.rdd.compress=true` can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets.
      
      Another option for the method name is `setIntermediateRDDStorageLevel`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1913 from mengxr/als-storagelevel and squashes the following commits:
      
      d942017 [Xiangrui Meng] rename to setIntermediateRDDStorageLevel
      7550029 [Xiangrui Meng] add ALS.setIntermediateDataStorageLevel
      69a57a18
    • Andrew Or's avatar
      [Docs] Add missing <code> tags (minor) · e4245656
      Andrew Or authored
      These configs looked inconsistent from the rest.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1936 from andrewor14/docs-code and squashes the following commits:
      
      15f578a [Andrew Or] Add <code> tag
      e4245656
    • Masayoshi TSUZUKI's avatar
      [SPARK-3006] Failed to execute spark-shell in Windows OS · 9497b12d
      Masayoshi TSUZUKI authored
      Modified the order of the options and arguments in spark-shell.cmd
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #1918 from tsudukim/feature/SPARK-3006 and squashes the following commits:
      
      8bba494 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
      1a32410 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
      9497b12d
  2. Aug 13, 2014
    • Patrick Wendell's avatar
      SPARK-3020: Print completed indices rather than tasks in web UI · 0c7b4529
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1933 from pwendell/speculation and squashes the following commits:
      
      33a3473 [Patrick Wendell] Use OpenHashSet
      8ce2ff0 [Patrick Wendell] SPARK-3020: Print completed indices rather than tasks in web UI
      0c7b4529
    • guowei's avatar
      [SPARK-2986] [SQL] fixed: setting properties does not effect · 63d67777
      guowei authored
      it seems that set command does not run by SparkSQLDriver. it runs on hive api.
      user can not change reduce number by setting spark.sql.shuffle.partitions
      
      but i think setting hive properties seems just a role to spark sql.
      
      Author: guowei <guowei@upyoo.com>
      
      Closes #1904 from guowei2/temp-branch and squashes the following commits:
      
      7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
      63d67777
    • Kousuke Saruta's avatar
      [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled · 905dc4b4
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1891 from sarutak/SPARK-2970 and squashes the following commits:
      
      4a2d2fe [Kousuke Saruta] Modified comment style
      8bd833c [Kousuke Saruta] Modified style
      6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
      905dc4b4
    • Michael Armbrust's avatar
      [SPARK-2935][SQL]Fix parquet predicate push down bug · 9fde1ff5
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1863 from marmbrus/parquetPredicates and squashes the following commits:
      
      10ad202 [Michael Armbrust] left <=> right
      f249158 [Michael Armbrust] quiet parquet tests.
      802da5b [Michael Armbrust] Add test case.
      eab2eda [Michael Armbrust] Fix parquet predicate push down bug
      9fde1ff5
    • Cheng Lian's avatar
      [SPARK-2650][SQL] More precise initial buffer size estimation for in-memory column buffer · 376a82e1
      Cheng Lian authored
      This is a follow up of #1880.
      
      Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:
      
      d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
      376a82e1
    • Michael Armbrust's avatar
      [SPARK-2994][SQL] Support for udfs that take complex types · 9256d4a9
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1915 from marmbrus/arrayUDF and squashes the following commits:
      
      a1c503d [Michael Armbrust] Support for udfs that take complex types
      9256d4a9
    • tianyi's avatar
      [SPARK-2817] [SQL] add "show create table" support · 13f54e2b
      tianyi authored
      In spark sql component, the "show create table" syntax had been disabled.
      We thought it is a useful funciton to describe a hive table.
      
      Author: tianyi <tianyi@asiainfo-linkage.com>
      Author: tianyi <tianyi@asiainfo.com>
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #1760 from tianyi/spark-2817 and squashes the following commits:
      
      7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
      cbffe8b [tianyi] [SPARK-2817] fix the case problem
      565ec14 [tianyi] [SPARK-2817] fix the case problem
      60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
      dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
      9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
      9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
      bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
      bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
      bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
      bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
      a337bd6 [tianyi] [SPARK-2817] add "show create table" support
      a03db77 [tianyi] [SPARK-2817] add "show create table" support
      13f54e2b
    • Cheng Lian's avatar
      [SPARK-3004][SQL] Added null checking when retrieving row set · bdc7a1a4
      Cheng Lian authored
      JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004)
      
      HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`.
      
      Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1920 from liancheng/spark-3004 and squashes the following commits:
      
      1b1db1c [Cheng Lian] Adding NULL column values in the Hive way
      2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set
      bdc7a1a4
    • Xiangrui Meng's avatar
      [MLLIB] use Iterator.fill instead of Array.fill · 7ecb867c
      Xiangrui Meng authored
      Iterator.fill uses less memory
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1930 from mengxr/rand-gen-iter and squashes the following commits:
      
      24178ca [Xiangrui Meng] use Iterator.fill instead of Array.fill
      7ecb867c
    • Davies Liu's avatar
      [SPARK-2983] [PySpark] improve performance of sortByKey() · 434bea1c
      Davies Liu authored
      1. skip partitionBy() when numOfPartition is 1
      2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
      rangePartitioner
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1898 from davies/sort and squashes the following commits:
      
      0a9608b [Davies Liu] Merge branch 'master' into sort
      1cf9565 [Davies Liu] improve performance of sortByKey()
      434bea1c
    • Davies Liu's avatar
      [SPARK-3013] [SQL] [PySpark] convert array into list · c974a716
      Davies Liu authored
      because Pyrolite does not support array from Python 2.6
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1928 from davies/fix_array and squashes the following commits:
      
      858e6c5 [Davies Liu] convert array into list
      c974a716
    • Kousuke Saruta's avatar
      [SPARK-2963] [SQL] There no documentation about building to use HiveServer and CLI for SparkSQL · 869f06c7
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1885 from sarutak/SPARK-2963 and squashes the following commits:
      
      ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
      07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
      6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
      c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
      869f06c7
    • Doris Xin's avatar
      [SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatisticalSummary) in Statistics · fe473595
      Doris Xin authored
      For both Scala and Python.
      
      The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`.
      
      `MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1911 from dorx/colStats and squashes the following commits:
      
      77b9924 [Doris Xin] developerAPI tag
      de9cbbe [Doris Xin] reviewer comments and moved more ser/de
      459faba [Doris Xin] colStats in Statistics for both Scala and Python
      fe473595
    • Zhang, Liye's avatar
      [SPARK-1777 (partial)] bugfix: make size of requested memory correctly · 2bd81263
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #1892 from liyezhang556520/lazy_memory_request and squashes the following commits:
      
      335ab61 [Zhang, Liye] [SPARK-1777 (partial)] bugfix: make size of requested memory correctly
      2bd81263
    • Raymond Liu's avatar
      Use transferTo when copy merge files in ExternalSorter · 246cb3f1
      Raymond Liu authored
      Since this is a file to file copy, using transferTo should be faster.
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1884 from colorant/externalSorter and squashes the following commits:
      
      6e42f3c [Raymond Liu] More code into copyStream
      bfb496b [Raymond Liu] Use transferTo when copy merge files in ExternalSorter
      246cb3f1
    • Reynold Xin's avatar
      [SPARK-2953] Allow using short names for io compression codecs · 676f9828
      Reynold Xin authored
      Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits:
      
      9f50962 [Reynold Xin] Specify short-form compression codec names first.
      63f78ee [Reynold Xin] Updated configuration documentation.
      47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
      676f9828
  3. Aug 12, 2014
    • Ameet Talwalkar's avatar
      SPARK-2830 [MLlib]: re-organize mllib documentation · c235b83e
      Ameet Talwalkar authored
      As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
      
      Author: Ameet Talwalkar <atalwalkar@gmail.com>
      
      Closes #1908 from atalwalkar/master and squashes the following commits:
      
      fe6938a [Ameet Talwalkar] made xiangruis suggested changes
      840028b [Ameet Talwalkar] made xiangruis suggested changes
      7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
      c235b83e
    • Davies Liu's avatar
      fix flaky tests · 882da57a
      Davies Liu authored
      Python 2.6 does not handle float error well as 2.7+
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1910 from davies/fix_test and squashes the following commits:
      
      7e51200 [Davies Liu] fix flaky tests
      882da57a
    • Liquan Pei's avatar
      [MLlib] Correctly set vectorSize and alpha · f0060b75
      Liquan Pei authored
      mengxr
      Correctly set vectorSize and alpha in Word2Vec training.
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #1900 from Ishiihara/Word2Vec-bugfix and squashes the following commits:
      
      85f64f2 [Liquan Pei] correctly set vectorSize and alpha
      f0060b75
    • Xiangrui Meng's avatar
      [SPARK-2923][MLLIB] Implement some basic BLAS routines · 9038d94e
      Xiangrui Meng authored
      Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance.
      
      Tested on my local machine:
      
      ~~~
      bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
      examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \
      --regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary
      ~~~
      
      1. before: ~1m
      2. after: ~30s
      
      CC: jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1849 from mengxr/ml-blas and squashes the following commits:
      
      ba583a2 [Xiangrui Meng] exclude Vector.copy
      a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas
      6edeab9 [Xiangrui Meng] address comments
      940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS
      c2a38bc [Xiangrui Meng] enhance dot tests
      4cfaac4 [Xiangrui Meng] add apache header
      48d01d2 [Xiangrui Meng] add tests for zeros and copy
      3b882b1 [Xiangrui Meng] use blas.scal in gradient
      735eb23 [Xiangrui Meng] remove d from BLAS routines
      d2d7d3c [Xiangrui Meng] update gradient and lbfgs
      7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS
      14e6645 [Xiangrui Meng] add ddot
      cbb8273 [Xiangrui Meng] add daxpy test
      07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas
      e8c326d [Xiangrui Meng] axpy
      9038d94e
  4. Aug 11, 2014
    • Cheng Hao's avatar
      [SQL] [SPARK-2826] Reduce the memory copy while building the hashmap for HashOuterJoin · 5d54d71d
      Cheng Hao authored
      This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests.
      ```
      Before:
      LeftOuterJoin: took 16750 ms ([3000000] records)
      LeftOuterJoin: took 15179 ms ([3000000] records)
      RightOuterJoin: took 15515 ms ([3000000] records)
      RightOuterJoin: took 15276 ms ([3000000] records)
      FullOuterJoin: took 19150 ms ([6000000] records)
      FullOuterJoin: took 18935 ms ([6000000] records)
      
      After:
      LeftOuterJoin: took 15218 ms ([3000000] records)
      LeftOuterJoin: took 13503 ms ([3000000] records)
      RightOuterJoin: took 13663 ms ([3000000] records)
      RightOuterJoin: took 14025 ms ([3000000] records)
      FullOuterJoin: took 16624 ms ([6000000] records)
      FullOuterJoin: took 16578 ms ([6000000] records)
      ```
      
      Besides the performance improvement, I also do some clean up as suggested in #1147
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits:
      
      ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap
      5d54d71d
    • Michael Armbrust's avatar
      [SPARK-2650][SQL] Build column buffers in smaller batches · bad21ed0
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1880 from marmbrus/columnBatches and squashes the following commits:
      
      0649987 [Michael Armbrust] add test
      4756fad [Michael Armbrust] fix compilation
      2314532 [Michael Armbrust] Build column buffers in smaller batches
      bad21ed0
    • Takuya UESHIN's avatar
      [SPARK-2968][SQL] Fix nullabilities of Explode. · c686b7dd
      Takuya UESHIN authored
      Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1888 from ueshin/issues/SPARK-2968 and squashes the following commits:
      
      d128c95 [Takuya UESHIN] Fix nullability of Explode.
      c686b7dd
    • Takuya UESHIN's avatar
      [SPARK-2965][SQL] Fix HashOuterJoin output nullabilities. · c9c89c31
      Takuya UESHIN authored
      Output attributes of opposite side of `OuterJoin` should be nullable.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1887 from ueshin/issues/SPARK-2965 and squashes the following commits:
      
      bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.
      c9c89c31
    • Yin Huai's avatar
      [SQL] A tiny refactoring in HiveContext#analyze · 647aeba3
      Yin Huai authored
      I should use `EliminateAnalysisOperators` in  `analyze` instead of manually pattern matching.
      
      Author: Yin Huai <huaiyin.thu@gmail.com>
      
      Closes #1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits:
      
      f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.
      647aeba3
    • wangfei's avatar
      [sql]use SparkSQLEnv.stop() in ShutdownHook · e83fdcd4
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #1852 from scwf/patch-3 and squashes the following commits:
      
      ae28c29 [wangfei] use SparkSQLEnv.stop() in ShutdownHook
      e83fdcd4
Loading