Skip to content
Snippets Groups Projects
  1. Oct 18, 2014
    • Davies Liu's avatar
      [SPARK-3952] [Streaming] [PySpark] add Python examples in Streaming Programming Guide · 05db2da7
      Davies Liu authored
      Having Python examples in Streaming Programming Guide.
      
      Also add RecoverableNetworkWordCount example.
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2808 from davies/pyguide and squashes the following commits:
      
      8d4bec4 [Davies Liu] update readme
      26a7e37 [Davies Liu] fix format
      3821c4d [Davies Liu] address comments, add missing file
      7e4bb8a [Davies Liu] add Python examples in Streaming Programming Guide
      05db2da7
    • Sean Owen's avatar
      SPARK-3926 [CORE] Result of JavaRDD.collectAsMap() is not Serializable · f406a839
      Sean Owen authored
      Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2805 from srowen/SPARK-3926 and squashes the following commits:
      
      ecb78ee [Sean Owen] Fix conflict between java.io.Serializable and use of Scala's Serializable
      f4717f9 [Sean Owen] Oops, fix compile problem
      ae1b36f [Sean Owen] Expand to cover Maps returned from other Java API methods as well
      51c26c2 [Sean Owen] Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are
      f406a839
  2. Oct 17, 2014
    • Joseph K. Bradley's avatar
      [SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree · 477c6481
      Joseph K. Bradley authored
      SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered.
      Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset).
      
      Added test to RandomForestSuite which failed with old version but now works.
      
      SPARK-3918: Added baggedInput.unpersist at end of training.
      
      Also:
      * I removed DTStatsAggregator.isUnordered since it is no longer used.
      * DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced.
      * Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data.  This is a temporary fix which should eventually be replaced by pre-indexing both datasets.
      * RandomForestModel: Updated toString to print total number of nodes in forest.
      * Changed Predict class to be public DeveloperApi.  This was necessary to allow users to create their own trees by hand (for testing).
      
      CC: mengxr  manishamde chouqin codedeft  Just notifying you of these small bug fixes.
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2785 from jkbradley/dtrunner-update and squashes the following commits:
      
      9132321 [Joseph K. Bradley] merged with master, fixed imports
      9dbd000 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
      e116473 [Joseph K. Bradley] Changed Predict class to be public DeveloperApi.
      f502e65 [Joseph K. Bradley] bug fix for SPARK-3934
      7f3d60f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
      ba567ab [Joseph K. Bradley] Changed DTRunner to load test data using same number of features as in training data.
      4e88c1f [Joseph K. Bradley] changed RF toString to print total number of nodes
      477c6481
    • Daoyuan Wang's avatar
      [SPARK-3985] [Examples] fix file path using os.path.join · 23f6171d
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2834 from adrian-wang/sqlpypath and squashes the following commits:
      
      da7aa95 [Daoyuan Wang] fix file path using path.join
      23f6171d
    • Michael Armbrust's avatar
      [SPARK-3855][SQL] Preserve the result attribute of python UDFs though transformations · adcb7d33
      Michael Armbrust authored
      In the current implementation it was possible for the reference to change after analysis.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits:
      
      da14879 [Michael Armbrust] Fix test
      6343bcb [Michael Armbrust] add test
      9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations
      adcb7d33
    • Marcelo Vanzin's avatar
      [SPARK-3979] [yarn] Use fs's default replication. · 803e7f08
      Marcelo Vanzin authored
      This avoids issues when HDFS is configured in a way that would not
      allow the hardcoded default replication of "3".
      
      Note: getDefaultReplication(Path) was added in 0.23.3, and the oldest
      one available on Maven Central is 0.23.7, so I chose to not add code
      to access that method via reflection.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2831 from vanzin/SPARK-3979 and squashes the following commits:
      
      b0e3a97 [Marcelo Vanzin] [SPARK-3979] [yarn] Use fs's default replication.
      803e7f08
    • likun's avatar
      [SPARK-3935][Core] log the number of records that has been written · c3518620
      likun authored
      There is a unused variable(count) in saveAsHadoopDataset in PairRDDFunctions.scala. The initial idea of this variable seems to count the number of records, so I am adding a log statement to log the number of records that has been written to the writer.
      
      Author: likun <jacky.likun@huawei.com>
      Author: jackylk <jacky.likun@huawei.com>
      
      Closes #2791 from jackylk/SPARK-3935 and squashes the following commits:
      
      a874047 [jackylk] removing the unused variable in PairRddFunctions.scala
      3bf43c7 [likun] log the number of records has been written
      c3518620
  3. Oct 16, 2014
    • Shivaram Venkataraman's avatar
      [SPARK-3973] Print call site information for broadcasts · e678b9f0
      Shivaram Venkataraman authored
      Its hard to debug which broadcast variables refer to what in a big codebase. Printing call site information helps in debugging.
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #2829 from shivaram/spark-broadcast-print and squashes the following commits:
      
      cd6dbdf [Shivaram Venkataraman] Print call site information for broadcasts
      e678b9f0
    • yantangzhai's avatar
      [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes · dedace83
      yantangzhai authored
      JobProgressPage could not show Fair Scheduler Pools section sometimes.
      SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more.
      
      Author: yantangzhai <tyz0303@163.com>
      Author: YanTangZhai <hakeemzhai@tencent.com>
      
      Closes #1966 from YanTangZhai/SPARK-3067 and squashes the following commits:
      
      d4323f8 [yantangzhai] update [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
      8a00106 [YanTangZhai] Merge pull request #6 from apache/master
      b6391cc [yantangzhai] revert [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
      d2226cd [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
      cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
      aac7f7b [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
      cdef539 [YanTangZhai] Merge pull request #1 from apache/master
      dedace83
    • zsxwing's avatar
      [SPARK-3741] Add afterExecute for handleConnectExecutor · 56fd34af
      zsxwing authored
      Sorry. I found that I forgot to add `afterExecute` for `handleConnectExecutor` in #2593.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #2794 from zsxwing/SPARK-3741 and squashes the following commits:
      
      a0bc4dd [zsxwing] Add afterExecute for handleConnectExecutor
      56fd34af
    • WangTaoTheTonic's avatar
      [SPARK-3890][Docs]remove redundant spark.executor.memory in doc · e7f4ea8a
      WangTaoTheTonic authored
      Introduced in https://github.com/pwendell/spark/commit/f7e79bc42c1635686c3af01eef147dae92de2529, I'm not sure why we need two spark.executor.memory here.
      
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #2745 from WangTaoTheTonic/redundantconfig and squashes the following commits:
      
      e7564dc [WangTao] too long line
      fdbdb1f [WangTaoTheTonic] trivial workaround
      d06b6e5 [WangTaoTheTonic] remove redundant spark.executor.memory in doc
      e7f4ea8a
    • Zhang, Liye's avatar
      [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo · 642b246b
      Zhang, Liye authored
      In BlockManagermasterActor, _remainingMem would increase memSize for twice when updateBlockInfo if new storageLevel is invalid and old storageLevel is "useMemory". Also, _remainingMem should increase with original memory size instead of new memSize.
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #2792 from liyezhang556520/spark-3941-remainMem and squashes the following commits:
      
      3d487cc [Zhang, Liye] make the code concise
      0380a32 [Zhang, Liye] [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo
      642b246b
    • Kun Li's avatar
      [SQL]typo in HiveFromSpark · be2ec4a9
      Kun Li authored
      Author: Kun Li <jacky.likun@gmail.com>
      
      Closes #2809 from jackylk/patch-1 and squashes the following commits:
      
      46c926b [Kun Li] typo in HiveFromSpark
      be2ec4a9
    • Aaron Davidson's avatar
      [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval · 7f7b50ed
      Aaron Davidson authored
      Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!
      
      I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #2784 from aarondav/fix-timeout and squashes the following commits:
      
      bd1151a [Aaron Davidson] Increase pause, don't decrease interval
      9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
      7f7b50ed
    • Prashant Sharma's avatar
      SPARK-3874: Provide stable TaskContext API · 2fe0ba95
      Prashant Sharma authored
      This is a small number of clean-up changes on top of #2782. Closes #2782.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2803 from pwendell/pr-2782 and squashes the following commits:
      
      56d5b7a [Patrick Wendell] Minor clean-up
      44089ec [Patrick Wendell] Clean-up the TaskContext API.
      ed551ce [Prashant Sharma] Fixed a typo
      df261d0 [Prashant Sharma] Josh's suggestion
      facf3b1 [Prashant Sharma] Fixed the mima issue.
      7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
      bbd9e05 [Prashant Sharma] adding missed out files to git.
      ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
      2fe0ba95
    • Cheng Lian's avatar
      [SQL] Fixes the race condition that may cause test failure · 99e416b6
      Cheng Lian authored
      The removed `Future` was used to end the test case as soon as the Spark SQL CLI process exits. When the process exits prematurely, this mechanism prevents the test case to wait until timeout. But it also creates a race condition: when `foundAllExpectedAnswers.tryFailure` is called, there are chances that the last expected output line of the CLI process hasn't been caught by the main logics of the test code, thus fails the test case.
      
      Removing this `Future` doesn't affect correctness.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2823 from liancheng/clean-clisuite and squashes the following commits:
      
      489a97c [Cheng Lian] Fixes the race condition that may cause test failure
      99e416b6
    • Davies Liu's avatar
      [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode · 091d32c5
      Davies Liu authored
      Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
      
      So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2830 from davies/fix_pickle and squashes the following commits:
      
      0c85fb9 [Davies Liu] revert the privacy change
      6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
      0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
      091d32c5
    • Shiti's avatar
      [SPARK-3944][Core] Code re-factored as suggested · 4c589cac
      Shiti authored
      Author: Shiti <ssaxena.ece@gmail.com>
      
      Closes #2810 from Shiti/master and squashes the following commits:
      
      051d82f [Shiti] setting the default value of uri scheme to "file"  where matching "file" or None yields the same result
      4c589cac
    • prudhvi's avatar
      [Core] Upgrading ScalaStyle version to 0.5 and removing SparkSpaceAfterCommentStartChecker. · 044583a2
      prudhvi authored
      Author: prudhvi <prudhvi953@gmail.com>
      
      Closes #2799 from prudhvije/ScalaStyle/space-after-comment-start and squashes the following commits:
      
      fc263a1 [prudhvi] [Core] Using scalastyle to check the space after comment start
      044583a2
  4. Oct 15, 2014
    • GuoQiang Li's avatar
      [SPARK-2098] All Spark processes should support spark-defaults.conf, config file · 293a0b5d
      GuoQiang Li authored
      This is another implementation about #1256
      cc andrewor14 vanzin
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #2379 from witgo/SPARK-2098-new and squashes the following commits:
      
      4ef1cbd [GuoQiang Li] review commit
      49ef70e [GuoQiang Li] Refactor getDefaultPropertiesFile
      c45d20c [GuoQiang Li] All Spark processes should support spark-defaults.conf, config file
      293a0b5d
  5. Oct 14, 2014
    • Sean Owen's avatar
      SPARK-1307 [DOCS] Don't use term 'standalone' to refer to a Spark Application · 18ab6bd7
      Sean Owen authored
      HT to Diana, just proposing an implementation of her suggestion, which I rather agreed with. Is there a second/third for the motion?
      
      Refer to "self-contained" rather than "standalone" apps to avoid confusion with standalone deployment mode. And fix placement of reference to this in MLlib docs.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2787 from srowen/SPARK-1307 and squashes the following commits:
      
      b5b82e2 [Sean Owen] Refer to "self-contained" rather than "standalone" apps to avoid confusion with standalone deployment mode. And fix placement of reference to this in MLlib docs.
      18ab6bd7
    • Masayoshi TSUZUKI's avatar
      [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows · 66af8e25
      Masayoshi TSUZUKI authored
      Modified not to pollute environment variables.
      Just moved the main logic into `XXX2.cmd` from `XXX.cmd`, and call `XXX2.cmd` with cmd command in `XXX.cmd`.
      `pyspark.cmd` and `spark-class.cmd` are already using the same way, but `spark-shell.cmd`, `spark-submit.cmd` and `/python/docs/make.bat` are not.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #2797 from tsudukim/feature/SPARK-3943 and squashes the following commits:
      
      b397a7d [Masayoshi TSUZUKI] [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows
      66af8e25
    • cocoatomo's avatar
      [SPARK-3869] ./bin/spark-class miss Java version with _JAVA_OPTIONS set · 7b4f39f6
      cocoatomo authored
      When _JAVA_OPTIONS environment variable is set, a command "java -version" outputs a message like "Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8".
      ./bin/spark-class knows java version from the first line of "java -version" output, so it mistakes java version with _JAVA_OPTIONS set.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2725 from cocoatomo/issues/3869-mistake-java-version and squashes the following commits:
      
      f894ebd [cocoatomo] [SPARK-3869] ./bin/spark-class miss Java version with _JAVA_OPTIONS set
      7b4f39f6
    • Sean Owen's avatar
      SPARK-3803 [MLLIB] ArrayIndexOutOfBoundsException found in executing computePrincipalComponents · 56096dba
      Sean Owen authored
      Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2801 from srowen/SPARK-3803 and squashes the following commits:
      
      b4e6d92 [Sean Owen] Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
      56096dba
    • shitis's avatar
      [SPARK-3944][Core] Using Option[String] where value of String can be null · 24b818b9
      shitis authored
      Author: shitis <ssaxena.ece@gmail.com>
      
      Closes #2795 from Shiti/master and squashes the following commits:
      
      46897d7 [shitis] Using Option Wrapper to convert String with value null to None
      24b818b9
    • Masayoshi TSUZUKI's avatar
      [SPARK-3946] gitignore in /python includes wrong directory · 7ced88b0
      Masayoshi TSUZUKI authored
      Modified to ignore not the docs/ directory, but only the docs/_build/ which is the output directory of sphinx build.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #2796 from tsudukim/feature/SPARK-3946 and squashes the following commits:
      
      2bea6a9 [Masayoshi TSUZUKI] [SPARK-3946] gitignore in /python includes wrong directory
      7ced88b0
    • Bill Bejeck's avatar
      SPARK-3178 setting SPARK_WORKER_MEMORY to a value without a label (m or g)... · 9b6de6fb
      Bill Bejeck authored
      SPARK-3178  setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
      
      Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
      
      Author: Bill Bejeck <bbejeck@gmail.com>
      
      Closes #2309 from bbejeck/spark-memory-worker and squashes the following commits:
      
      51cf915 [Bill Bejeck] SPARK-3178 - Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
      9b6de6fb
    • Aaron Davidson's avatar
      [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode · 186b497c
      Aaron Davidson authored
      The goal of this patch is to fix the swapped arguments in standalone mode, which was caused by  https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153.
      
      More details can be found in the JIRA: [SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)
      
      Tested in Standalone mode, but not in Mesos.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #2779 from aarondav/fix-standalone and squashes the following commits:
      
      725227a [Aaron Davidson] Fix ExecutorRunnerTest
      9d703fe [Aaron Davidson] [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
      186b497c
    • Tathagata Das's avatar
      [SPARK-3912][Streaming] Fixed flakyFlumeStreamSuite · 4d26aca7
      Tathagata Das authored
      @harishreedharan @pwendell
      See JIRA for diagnosis of the problem
      https://issues.apache.org/jira/browse/SPARK-3912
      
      The solution was to reimplement it.
      1. Find a free port (by binding and releasing a server-scoket), and then use that port
      2. Remove thread.sleep()s, instead repeatedly try to create a sender and send data and check whether data was sent. Use eventually() to minimize waiting time.
      3. Check whether all the data was received, without caring about batches.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2773 from tdas/flume-test-fix and squashes the following commits:
      
      93cd7f6 [Tathagata Das] Reimplimented FlumeStreamSuite to be more robust.
      4d26aca7
  6. Oct 13, 2014
    • Cheng Lian's avatar
      [SPARK-3809][SQL] Fixes test suites in hive-thriftserver · 9eb49d41
      Cheng Lian authored
      As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.
      
      In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous.
      
      As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](https://github.com/apache/spark/pull/2675#issuecomment-58464189) below for details.
      
      ----
      
      (Copied from PR description of #2214)
      
      This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement:
      
      1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
      2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
      3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout.
      
      So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2675 from liancheng/fix-thriftserver-tests and squashes the following commits:
      
      1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala
      7805c33 [wangfei]  reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths
      af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext
      d116405 [wangfei] make sure that log4j level is INFO
      ee92a82 [Cheng Lian] Relaxes timeout
      7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver
      9eb49d41
    • Liquan Pei's avatar
      [SQL]Small bug in unresolved.scala · 9d9ca91f
      Liquan Pei authored
      name should throw exception with name instead of exprId.
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #2758 from Ishiihara/SparkSQL-bug and squashes the following commits:
      
      aa36a3b [Liquan Pei] small bug
      9d9ca91f
    • chirag's avatar
      SPARK-3807: SparkSql does not work for tables created using custom serde · e6e37701
      chirag authored
      
      SparkSql crashes on selecting tables using custom serde.
      
      Example:
      ----------------
      
      CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE;
      
      The following exception is seen on running a query like 'select * from table_name limit 1':
      
      ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException
      at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
      at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80)
      at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
      at org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:100)
      at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
      at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
      at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
      at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
      at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
      at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
      at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
      at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
      at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
      at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: java.lang.NullPointerException
      
      Author: chirag <chirag.aggarwal@guavus.com>
      
      Closes #2674 from chiragaggarwal/branch-1.1 and squashes the following commits:
      
      370c31b [chirag] SPARK-3807: Add a test case to validate the fix.
      1f26805 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde (Incorporated Review Comments)
      ba4bc0c [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
      5c73b72 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
      
      (cherry picked from commit 925e22d3)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      e6e37701
    • Michael Armbrust's avatar
      [SQL] Add type checking debugging functions · 371321ca
      Michael Armbrust authored
      Adds some functions that were very useful when trying to track down the bug from #2656.  This change also changes the tree output for query plans to include the `'` prefix to unresolved nodes and `!` prefix to nodes that refer to non-existent attributes.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2657 from marmbrus/debugging and squashes the following commits:
      
      654b926 [Michael Armbrust] Clean-up, add tests
      763af15 [Michael Armbrust] Add typeChecking debugging functions
      8c69303 [Michael Armbrust] Add inputSet, references to QueryPlan. Improve tree string with a prefix to denote invalid or unresolved nodes.
      fbeab54 [Michael Armbrust] Better toString, factories for AttributeSet.
      371321ca
    • Venkata Ramana Gollamudi's avatar
      [SPARK-3559][SQL] Remove unnecessary columns from List of needed Column Ids in Hive Conf · e10d71e7
      Venkata Ramana Gollamudi authored
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #2713 from gvramana/remove_unnecessary_columns and squashes the following commits:
      
      b7ba768 [Venkata Ramana Gollamudi] Added comment and checkstyle fix
      6a93459 [Venkata Ramana Gollamudi] cloned hiveconf for each TableScanOperators so that only required columns are added
      e10d71e7
    • Takuya UESHIN's avatar
      [SPARK-3771][SQL] AppendingParquetOutputFormat should use reflection to... · 73da9c26
      Takuya UESHIN authored
      [SPARK-3771][SQL] AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility.
      
      Original problem is [SPARK-3764](https://issues.apache.org/jira/browse/SPARK-3764).
      
      `AppendingParquetOutputFormat` uses a binary-incompatible method `context.getTaskAttemptID`.
      This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #2638 from ueshin/issues/SPARK-3771 and squashes the following commits:
      
      efd3784 [Takuya UESHIN] Add a comment to explain the reason to use reflection.
      ec213c1 [Takuya UESHIN] Use reflection to prevent breaking binary-compatibility.
      73da9c26
    • Cheng Hao's avatar
      [SPARK-3529] [SQL] Delete the temp files after test exit · d3cdf912
      Cheng Hao authored
      There are lots of temporal files created by TestHive under the /tmp by default, which may cause potential performance issue for testing. This PR will automatically delete them after test exit.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2393 from chenghao-intel/delete_temp_on_exit and squashes the following commits:
      
      3a6511f [Cheng Hao] Remove the temp dir after text exit
      d3cdf912
    • Cheng Lian's avatar
      [SPARK-2066][SQL] Adds checks for non-aggregate attributes with aggregation · 56102dc2
      Cheng Lian authored
      This PR adds a new rule `CheckAggregation` to the analyzer to provide better error message for non-aggregate attributes with aggregation.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2774 from liancheng/non-aggregate-attr and squashes the following commits:
      
      5246004 [Cheng Lian] Passes test suites
      bf1878d [Cheng Lian] Adds checks for non-aggregate attributes with aggregation
      56102dc2
    • Daoyuan Wang's avatar
      [SPARK-3407][SQL]Add Date type support · 2ac40da3
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2344 from adrian-wang/date and squashes the following commits:
      
      f15074a [Daoyuan Wang] remove outdated lines
      2038085 [Daoyuan Wang] update return type
      00fe81f [Daoyuan Wang] address lian cheng's comments
      0df6ea1 [Daoyuan Wang] rebase and remove simple string
      bb1b1ef [Daoyuan Wang] remove failing test
      aa96735 [Daoyuan Wang] not cast for same type compare
      30bf48b [Daoyuan Wang] resolve rebase conflict
      617d1a8 [Daoyuan Wang] add date_udf case to white list
      c37e848 [Daoyuan Wang] comment update
      5429212 [Daoyuan Wang] change to long
      f8f219f [Daoyuan Wang] revise according to Cheng Hao
      0e0a4f5 [Daoyuan Wang] minor format
      4ddcb92 [Daoyuan Wang] add java api for date
      0e3110e [Daoyuan Wang] try to fix timezone issue
      17fda35 [Daoyuan Wang] set test list
      2dfbb5b [Daoyuan Wang] support date type
      2ac40da3
    • Daoyuan Wang's avatar
      [SPARK-3892][SQL] remove redundant type name · 46db277c
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2747 from adrian-wang/typename and squashes the following commits:
      
      2824216 [Daoyuan Wang] remove redundant typeName
      fbaf340 [Daoyuan Wang] typename
      46db277c
    • yingjieMiao's avatar
      [Spark] RDD take() method: overestimate too much · 49bbdcb6
      yingjieMiao authored
      In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
      
      `(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
      Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
      
      This could be a performance problem. (unless this is the intended behavior)
      
      Author: yingjieMiao <yingjie@42go.com>
      
      Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:
      
      d758218 [yingjieMiao] scala style fix
      a8e74bb [yingjieMiao] python style fix
      4b6e777 [yingjieMiao] infix operator style fix
      4391d3b [yingjieMiao] typo fix.
      692f4e6 [yingjieMiao] cap numPartsToTry
      c4483dc [yingjieMiao] style fix
      1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
      d31ff7e [yingjieMiao] handle the edge case after 1 iteration
      a2aa36b [yingjieMiao] RDD take method: overestimate too much
      49bbdcb6
Loading