Skip to content
Snippets Groups Projects
  1. Apr 13, 2015
    • Doug Balog's avatar
      [SPARK-6207] [YARN] [SQL] Adds delegation tokens for metastore to conf. · 77620be7
      Doug Balog authored
      Adds hive2-metastore delegation token to conf when running in secure mode.
      Without this change, running on YARN in cluster mode fails with a
      GSS exception.
      
      This is a rough patch that adds a dependency to spark/yarn on hive-exec.
      I'm looking for suggestions on how to make this patch better.
      
      This contribution is my original work and that I licenses the work to the
      Apache Spark project under the project's open source licenses.
      
      Author: Doug Balog <doug.balogtarget.com>
      
      Author: Doug Balog <doug.balog@target.com>
      
      Closes #5031 from dougb/SPARK-6207 and squashes the following commits:
      
      3e9ac16 [Doug Balog] [SPARK-6207] Fixes minor code spacing issues.
      e260765 [Doug Balog] [SPARK-6207] Second pass at adding Hive delegation token to conf. - Use reflection instead of adding dependency on hive. - Tested on Hive 0.13 and Hadoop 2.4.1
      1ab1729 [Doug Balog] Merge branch 'master' of git://github.com/apache/spark into SPARK-6207
      bf356d2 [Doug Balog] [SPARK-6207] [YARN] [SQL] Adds delegation tokens for metastore to conf. Adds hive2-metastore delagations token to conf when running in securemode. Without this change, runing on YARN in cluster mode fails with a GSS exception.
      77620be7
    • Pei-Lun Lee's avatar
      [SPARK-6352] [SQL] Add DirectParquetOutputCommitter · b29663ee
      Pei-Lun Lee authored
      Add a DirectParquetOutputCommitter class that skips _temporary directory when saving to s3. Add new config value "spark.sql.parquet.useDirectParquetOutputCommitter" (default false) to choose between the default output committer.
      
      Author: Pei-Lun Lee <pllee@appier.com>
      
      Closes #5042 from ypcat/spark-6352 and squashes the following commits:
      
      e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
      9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
      0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
      c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
      0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
      769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
      f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
      b29663ee
    • linweizhong's avatar
      [SPARK-6870][Yarn] Catch InterruptedException when yarn application state... · 202ebf06
      linweizhong authored
      [SPARK-6870][Yarn] Catch InterruptedException when yarn application state monitor thread been interrupted
      
      On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #5479 from Sephiroth-Lin/SPARK-6870 and squashes the following commits:
      
      f775f93 [linweizhong] Update, don't need to call Thread.currentThread() on monitor thread
      0e2ef1f [linweizhong] Update
      0d8958a [linweizhong] Update
      3513fdb [linweizhong] Catch InterruptedException
      202ebf06
    • Pradeep Chanumolu's avatar
      [SPARK-6671] Add status command for spark daemons · 240ea03f
      Pradeep Chanumolu authored
      SPARK-6671
      Currently using the spark-daemon.sh script we can start and stop the spark demons. But we cannot get the status of the daemons. It will be nice to include the status command in the spark-daemon.sh script, through which we can know if the spark demon is alive or not.
      
      Author: Pradeep Chanumolu <pchanumolu@maprtech.com>
      
      Closes #5327 from pchanumolu/master and squashes the following commits:
      
      d3a1f05 [Pradeep Chanumolu] Make status command check consistent with Stop command
      5062926 [Pradeep Chanumolu] Fix indentation in spark-daemon.sh
      3e66bc8 [Pradeep Chanumolu] SPARK-6671 : Add status command to spark daemons
      1ac3918 [Pradeep Chanumolu] Add status command to spark-daemon
      240ea03f
    • nyaapa's avatar
      [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI · 9d117cee
      nyaapa authored
      Author: nyaapa <nyaapa@gmail.com>
      
      Closes #5424 from nyaapa/master and squashes the following commits:
      
      6b717aa [nyaapa] [SPARK-6440][CORE] Remove Utils.localIpAddressHostname, Utils.localIpAddressURI and Utils.getAddressHostName; make Utils.localIpAddress private; rename Utils.localHostURI into Utils.localHostNameForURI; use Utils.localHostName in org.apache.spark.streaming.kinesis.KinesisReceiver and org.apache.spark.sql.hive.thriftserver.SparkSQLEnv
      2098081 [nyaapa] [SPARK-6440][CORE] style fixes and use getHostAddress instead of getHostName
      84763d7 [nyaapa] [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI
      9d117cee
    • zsxwing's avatar
      [SPARK-6860][Streaming][WebUI] Fix the possible inconsistency of StreamingPage · 14ce3ea2
      zsxwing authored
      Because `StreamingPage.render` doesn't hold the `listener` lock when generating the content, the different parts of content may have some inconsistent values if `listener` updates its status at the same time. And it will confuse people.
      
      This PR added `listener.synchronized` to make sure we have a consistent view of StreamingJobProgressListener when creating the content.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5470 from zsxwing/SPARK-6860 and squashes the following commits:
      
      cec6f92 [zsxwing] Add missing 'synchronized' in StreamingJobProgressListener
      7182498 [zsxwing] Add synchronized to make sure we have a consistent view of StreamingJobProgressListener when creating the content
      14ce3ea2
    • lisurprise's avatar
      [SPARK-6762]Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader · cadd7d72
      lisurprise authored
      The close action should be placed within finally block to avoid the potential resource leaks
      
      Author: lisurprise <zhichao.li@intel.com>
      
      Closes #5407 from zhichao-li/master and squashes the following commits:
      
      065999f [lisurprise] add guard for null
      ef862d6 [lisurprise] remove fs.close
      a754adc [lisurprise] refactor with tryWithSafeFinally
      824adb3 [lisurprise] close before validation
      c877da7 [lisurprise] Fix potential resource leaks
      cadd7d72
    • Dean Chen's avatar
      [SPARK-6868][YARN] Fix broken container log link on executor page when HTTPS_ONLY. · 950645d5
      Dean Chen authored
      Correct http schema in YARN container log link in Spark UI when container logs when YARN is configured to be HTTPS_ONLY.
      
      Uses the same logic as the YARN jobtracker webapp. Entry point is [JobBlock](https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108) and logic is in [MRWebAppUtil](https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/util/MRWebAppUtil.java#L75).
      
      I chose to migrate the logic over instead of importing MRWebAppUtil(but can update the PR to do so) since the class is designated as private and the logic was straightforward.
      
      Author: Dean Chen <deanchen5@gmail.com>
      
      Closes #5477 from deanchen/container-url and squashes the following commits:
      
      91d3090 [Dean Chen] Correct http schema in YARN container log link in Spark UI when container logs when YARN is configured to be HTTPS_ONLY.
      950645d5
    • Reynold Xin's avatar
      [SPARK-6562][SQL] DataFrame.replace · 68d1faa3
      Reynold Xin authored
      Supports replacing values with other values in DataFrames.
      
      Python support should be in a separate pull request.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5282 from rxin/df-na-replace and squashes the following commits:
      
      4b72434 [Reynold Xin] Removed println.
      c8d9946 [Reynold Xin] col -> cols
      fbb3c21 [Reynold Xin] [SPARK-6562][SQL] DataFrame.replace
      68d1faa3
    • Xiangrui Meng's avatar
      [SPARK-5885][MLLIB] Add VectorAssembler as a feature transformer · 92940449
      Xiangrui Meng authored
      VectorAssembler merges multiple columns into a vector column. This PR contains content from #5195.
      
      ~~carry ML attributes~~ (moved to a follow-up PR)
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5196 from mengxr/SPARK-5885 and squashes the following commits:
      
      a52b101 [Xiangrui Meng] recognize more types
      35daac2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5885
      bb5e64b [Xiangrui Meng] add TODO for null
      976a3d6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5885
      0859311 [Xiangrui Meng] Revert "add CreateStruct"
      29fb6ac [Xiangrui Meng] use CreateStruct
      adb71c4 [Xiangrui Meng] Merge branch 'SPARK-6542' into SPARK-5885
      85f3106 [Xiangrui Meng] add CreateStruct
      4ff16ce [Xiangrui Meng] add VectorAssembler
      92940449
    • Xiangrui Meng's avatar
      [SPARK-5886][ML] Add StringIndexer as a feature transformer · 685ddcf5
      Xiangrui Meng authored
      This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.
      
      TODOs:
      - [x] store feature to index map in output metadata
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4735 from mengxr/SPARK-5886 and squashes the following commits:
      
      d82575f [Xiangrui Meng] fix test
      700e70f [Xiangrui Meng] rename LabelIndexer to StringIndexer
      16a6f8c [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
      457166e [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
      f8b30f4 [Xiangrui Meng] update label indexer to output metadata
      e81ec28 [Xiangrui Meng] Merge branch 'openhashmap-contains' into SPARK-5886-2
      d6e6f1f [Xiangrui Meng] add contains to primitivekeyopenhashmap
      748a69b [Xiangrui Meng] add contains to OpenHashMap
      def3c5c [Xiangrui Meng] add LabelIndexer
      685ddcf5
    • Joseph K. Bradley's avatar
      [SPARK-4081] [mllib] VectorIndexer · d3792f54
      Joseph K. Bradley authored
      **Ready for review!**
      
      Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.
      
      This introduces a VectorIndexer class which does the following:
      * VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
        * Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
      * VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices
      
      Design notes:
      * This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
      * This does not yet support transforming data with new (unknown) categorical feature values.  That can be added later.
      * This is necessary for DecisionTree and tree ensembles.
      
      Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.
      
      Other notes:
      * This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3000 from jkbradley/indexer and squashes the following commits:
      
      5956d91 [Joseph K. Bradley] minor cleanups
      f5c57a8 [Joseph K. Bradley] added Java test suite
      643b444 [Joseph K. Bradley] removed FeatureTests
      02236c3 [Joseph K. Bradley] Updated VectorIndexer, ready for PR
      286d221 [Joseph K. Bradley] Reworked DatasetIndexer for spark.ml API, and renamed it to VectorIndexer
      12e6cf2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into indexer
      6d8f3f1 [Joseph K. Bradley] Added partly done DatasetIndexer to spark.ml
      6a2f553 [Joseph K. Bradley] Updated TODO for allowUnknownCategories
      3f041f8 [Joseph K. Bradley] Final cleanups for DatasetIndexer
      038b9e3 [Joseph K. Bradley] DatasetIndexer now maintains sparsity in SparseVector
      3a4a0bd [Joseph K. Bradley] Added another test for DatasetIndexer
      2006923 [Joseph K. Bradley] DatasetIndexer now passes tests
      f409987 [Joseph K. Bradley] partly done with DatasetIndexerSuite
      5e7c874 [Joseph K. Bradley] working on DatasetIndexer
      d3792f54
    • lewuathe's avatar
      [SPARK-6643][MLLIB] Implement StandardScalerModel missing methods · fc176614
      lewuathe authored
      This is the sub-task of SPARK-6254.
      Wrap missing method for `StandardScalerModel`.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5310 from Lewuathe/SPARK-6643 and squashes the following commits:
      
      fafd690 [lewuathe] Fix for lint-python
      bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
      578f5ee [lewuathe] Remove unnecessary class
      a38f155 [lewuathe] Merge master
      66bb2ab [lewuathe] Fix typos
      82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
      fc176614
  2. Apr 12, 2015
    • Reynold Xin's avatar
      [SPARK-6765] Fix test code style for core. · a1fe59da
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5484 from rxin/test-style-core and squashes the following commits:
      
      e0b0100 [Reynold Xin] [SPARK-6765] Fix test code style for core.
      a1fe59da
    • Daoyuan Wang's avatar
      [MINOR] a typo: coalesce · 04bcd67c
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5482 from adrian-wang/typo and squashes the following commits:
      
      e65ef6f [Daoyuan Wang] typo
      04bcd67c
    • cody koeninger's avatar
      [SPARK-6431][Streaming][Kafka] Error message for partition metadata requ... · 6ac8eea2
      cody koeninger authored
      ...ests
      
      The original reported problem was misdiagnosed; the topic just didn't exist yet.  Agreed upon solution was to improve error handling / message
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #5454 from koeninger/spark-6431-master and squashes the following commits:
      
      44300f8 [cody koeninger] [SPARK-6431][Streaming][Kafka] Error message for partition metadata requests
      6ac8eea2
    • lisurprise's avatar
      [SPARK-6843][core]Add volatile for the "state" · ddc17431
      lisurprise authored
      Fix potential visibility problem for the "state" of Executor
      
      The field of "state" is shared and modified by multiple threads. i.e:
      
      ```scala
      Within ExecutorRunner.scala
      
      (1) workerThread = new Thread("ExecutorRunner for " + fullId) {
        override def run() { fetchAndRunExecutor() }
      }
       workerThread.start()
      // Shutdown hook that kills actors on shutdown.
      
      (2)shutdownHook = new Thread() {
        override def run() {
          killProcess(Some("Worker shutting down"))
        }
      }
      
      (3)and also the "Actor thread" for worker.
      
      ```
      I think we should at lease add volatile to ensure the visibility among threads otherwise the worker might send an out-of-date status to the master.
      
      https://issues.apache.org/jira/browse/SPARK-6843
      
      Author: lisurprise <zhichao.li@intel.com>
      
      Closes #5448 from zhichao-li/state and squashes the following commits:
      
      a2386e7 [lisurprise] add volatile for state field
      ddc17431
    • Guancheng (G.C.) Chen's avatar
      [SPARK-6866][Build] Remove duplicated dependency in launcher/pom.xml · e9445b18
      Guancheng (G.C.) Chen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6866
      
      Remove duplicated dependency of scalatest in launcher/pom.xml since it already inherited the dependency from the parent pom.xml.
      
      Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
      
      Closes #5476 from gchen/SPARK-6866 and squashes the following commits:
      
      1ab484b [Guancheng (G.C.) Chen] remove duplicated dependency in launcher/pom.xml
      e9445b18
    • Davies Liu's avatar
      [SPARK-6677] [SQL] [PySpark] fix cached classes · 5d8f7b9e
      Davies Liu authored
      It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype.
      
      This PR also change `__FIELDS__` and `__DATATYPE__` to lower case to match Python code style.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5445 from davies/fix_type_cache and squashes the following commits:
      
      63b3238 [Davies Liu] typo
      47bdede [Davies Liu] fix cached classes
      5d8f7b9e
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 0cc8fcb4
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #4994 (close requested by 'marmbrus')
      Closes #4995 (close requested by 'marmbrus')
      Closes #4491 (close requested by 'srowen')
      Closes #3597 (close requested by 'srowen')
      Closes #4693 (close requested by 'marmbrus')
      Closes #3855 (close requested by 'marmbrus')
      Closes #4398 (close requested by 'marmbrus')
      Closes #4246 (close requested by 'marmbrus')
      Closes #5153 (close requested by 'srowen')
      Closes #3626 (close requested by 'srowen')
      Closes #5166 (close requested by 'marmbrus')
      Closes #5040 (close requested by 'marmbrus')
      Closes #5044 (close requested by 'marmbrus')
      Closes #5440 (close requested by 'JoshRosen')
      Closes #4039 (close requested by 'marmbrus')
      Closes #1237 (close requested by 'srowen')
      Closes #216 (close requested by 'mengxr')
      Closes #5092 (close requested by 'srowen')
      Closes #5100 (close requested by 'marmbrus')
      Closes #4469 (close requested by 'marmbrus')
      Closes #5246 (close requested by 'srowen')
      Closes #5013 (close requested by 'marmbrus')
      0cc8fcb4
  3. Apr 11, 2015
    • Michael Malak's avatar
      SPARK-6710 GraphX Fixed Wrong initial bias in GraphX SVDPlusPlus · 1205f7ea
      Michael Malak authored
      Author: Michael Malak <michaelmalak@yahoo.com>
      
      Closes #5464 from michaelmalak/master and squashes the following commits:
      
      9d942ba [Michael Malak] SPARK-6710 GraphX Fixed Wrong initial bias in GraphX SVDPlusPlus
      1205f7ea
    • Josh Rosen's avatar
      dea5dacc
    • Wenchen Fan's avatar
      [SQL][minor] move `resolveGetField` into a object · 5c2844c5
      Wenchen Fan authored
      The method `resolveGetField` isn't belong to `LogicalPlan` logically and didn't access any members of it.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5435 from cloud-fan/tmp and squashes the following commits:
      
      9a66c83 [Wenchen Fan] code clean up
      5c2844c5
    • Yin Huai's avatar
      [SPARK-6367][SQL] Use the proper data type for those expressions that are... · 6d4e854f
      Yin Huai authored
      [SPARK-6367][SQL] Use the proper data type for those expressions that are hijacking existing data types.
      
      This PR adds internal UDTs for expressions that are hijacking existing data types.
      The following UDTs are added:
      * `HyperLogLogUDT` (`BinaryType` as the SQL type) for `ApproxCountDistinctPartition`
      * `OpenHashSetUDT` (`ArrayType` as the SQL type) for `CollectHashSet`, `NewSet`, `AddItemToSet`, and `CombineSets`.
      
      I am also adding more unit tests for aggregation with code gen enabled.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-6367
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5094 from yhuai/expressionType and squashes the following commits:
      
      8bcd11a [Yin Huai] Return types.
      61a1d66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into expressionType
      e8b4599 [Yin Huai] Merge remote-tracking branch 'upstream/master' into expressionType
      2753156 [Yin Huai] Ignore aggregations having sum functions for now.
      b5eb259 [Yin Huai] Case object for HyperLogLog type.
      00ebdbd [Yin Huai] deserialize/serialize.
      54b87ae [Yin Huai] Add UDTs for expressions that return HyperLogLog and OpenHashSet.
      6d4e854f
    • Yin Huai's avatar
      [SQL] Handle special characters in the authority of a Path's URI. · d2383fb5
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5381 from yhuai/parquetPath2 and squashes the following commits:
      
      fe296b4 [Yin Huai] Create new Path to take care special characters in the authority of a Path's URI.
      d2383fb5
    • Takeshi YAMAMURO's avatar
      [SPARK-6379][SQL] Support a functon to call user-defined functions registered in SQLContext · 352a5da4
      Takeshi YAMAMURO authored
      This is useful for using pre-defined UDFs in SQLContext;
      
      val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
      val sqlctx = df.sqlContext
      sqlctx.udf.register("simpleUdf", (v: Int) => v * v)
      df.select($"id", sqlctx.callUdf("simpleUdf", $"value"))
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #5061 from maropu/SupportUDFConversionInSparkContext and squashes the following commits:
      
      f858aff [Takeshi YAMAMURO] Move the function into functions.scala
      afd0380 [Takeshi YAMAMURO] Add a return type of callUDF
      599b76c [Takeshi YAMAMURO] Remove the implicit conversion and add SqlContext#callUdf
      8b56f10 [Takeshi YAMAMURO] Support an implicit conversion from udf"name" to an UDF defined in SQLContext
      352a5da4
    • DoingDone9's avatar
      [SPARK-6179][SQL] Add token for "SHOW PRINCIPALS role_name" and "SHOW... · 48cc8400
      DoingDone9 authored
      [SPARK-6179][SQL] Add token for "SHOW PRINCIPALS role_name" and "SHOW TRANSACTIONS" and "SHOW COMPACTIONS"
      
      [SHOW PRINCIPALS role_name]
      Lists all roles and users who belong to this role.
      Only the admin role has privilege for this.
      
      [SHOW COMPACTIONS]
      It returns a list of all tables and partitions currently being compacted or scheduled for compaction when Hive transactions are being used.
      
      [SHOW TRANSACTIONS]
      It is for use by administrators when Hive transactions are being used. It returns a list of all currently open and aborted transactions in the system.
      
      Author: DoingDone9 <799203320@qq.com>
      Author: Zhongshuai Pei <799203320@qq.com>
      Author: Xu Tingjun <xutingjun@huawei.com>
      
      Closes #4902 from DoingDone9/SHOW_PRINCIPALS and squashes the following commits:
      
      4add42f [Zhongshuai Pei] for test
      311f806 [Zhongshuai Pei] for test
      0c7550a [DoingDone9] Update HiveQl.scala
      c8aeb1c [Xu Tingjun] aa
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      48cc8400
    • lazymam500's avatar
      [Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext · 1f39a611
      lazymam500 authored
      This PR follow up PR #3907 & #3891 & #4356.
      According to  marmbrus  liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.
      
      [1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*)
      [2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
      [3]. do the filtering locally
      [4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)
      
      chenghao-intel jeanlyn
      
      Author: lazymam500 <lazyman500@gmail.com>
      Author: lazyman <lazyman500@gmail.com>
      
      Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits:
      
      5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style
      e1d6386 [lazymam500] fix scala style
      f23133f [lazymam500] bug fix
      47e0023 [lazymam500] fix scala style,add config flag,break the chaining
      04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2
      41f60ce [lazymam500] Merge pull request #1 from apache/master
      1f39a611
    • haiyang's avatar
      [SPARK-6199] [SQL] Support CTE in HiveContext and SQLContext · 2f535887
      haiyang authored
      Author: haiyang <huhaiyang@huawei.com>
      
      Closes #4929 from haiyangsea/cte and squashes the following commits:
      
      220b67d [haiyang] add golden files for cte test
      d3c7681 [haiyang] Merge branch 'master' into cte-repair
      0ba2070 [haiyang] modify code style
      9ce6b58 [haiyang] fix conflict
      ff74741 [haiyang] add comment for With plan
      0d56af4 [haiyang] code indention
      776a440 [haiyang] add comments for resolve relation strategy
      2fccd7e [haiyang] add comments for resolve relation strategy
      241bbe2 [haiyang] fix cte problem of view
      e9e1237 [haiyang] fix test case problem
      614182f [haiyang] add test cases for CTE feature
      32e415b [haiyang] add comment
      1cc8c15 [haiyang] support with
      03f1097 [haiyang] support with
      e960099 [haiyang] support with
      9aaa874 [haiyang] support with
      0566978 [haiyang] support with
      a99ecd2 [haiyang] support with
      c3fa4c2 [haiyang] support with
      3b6077f [haiyang] support with
      5f8abe3 [haiyang] support with
      4572b05 [haiyang] support with
      f801f54 [haiyang] support with
      2f535887
    • Guancheng (G.C.) Chen's avatar
      [Minor][SQL] Fix typo in sql · 7dbd3716
      Guancheng (G.C.) Chen authored
      In this PR, "analyser" is changed to "analyzer" to keep a consistent naming. Some other typos are also fixed.
      
      Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
      
      Closes #5474 from gchen/sql-typo and squashes the following commits:
      
      70e6e76 [Guancheng (G.C.) Chen] Merge branch 'sql-typo' of github.com:gchen/spark into sql-typo
      fb7a6e2 [Guancheng (G.C.) Chen] fix typo in sql
      37e3da1 [Guancheng (G.C.) Chen] fix type in sql
      7dbd3716
    • Santiago M. Mola's avatar
      [SPARK-6863] Fix formatting on SQL programming guide. · 6437e7cc
      Santiago M. Mola authored
      https://issues.apache.org/jira/browse/SPARK-6863
      
      Author: Santiago M. Mola <santiago.mola@sap.com>
      
      Closes #5472 from smola/fix/sql-docs and squashes the following commits:
      
      42503d4 [Santiago M. Mola] [SPARK-6863] Fix formatting on SQL programming guide.
      6437e7cc
    • Santiago M. Mola's avatar
      [SPARK-6611][SQL] Add support for INTEGER as synonym of INT. · 5f7b7cda
      Santiago M. Mola authored
      https://issues.apache.org/jira/browse/SPARK-6611
      
      Author: Santiago M. Mola <santiago.mola@sap.com>
      
      Closes #5271 from smola/features/integer-parse and squashes the following commits:
      
      f5c1c64 [Santiago M. Mola] [SPARK-6611] Add support for INTEGER as synonym of INT.
      5f7b7cda
    • Liang-Chi Hsieh's avatar
      [SPARK-6858][SQL] Register Java HashMap for SparkSqlSerializer · 198cf2a3
      Liang-Chi Hsieh authored
      Since now kyro serializer is used for `GeneralHashedRelation` whether kyro is enabled or not, it is better to register Java `HashMap` in `SparkSqlSerializer`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5465 from viirya/register_hashmap and squashes the following commits:
      
      9062601 [Liang-Chi Hsieh] Register Java HashMap for SparkSqlSerializer.
      198cf2a3
    • Cheng Hao's avatar
      [SPARK-6835] [SQL] Fix bug of Hive UDTF in Lateral View (ClassNotFound) · 3ceb810a
      Cheng Hao authored
      ```SQL
      select key, v from src lateral view stack(3, 1+1, 2+2, 3) d as v;
      ```
      Will cause exception
      ```
      java.lang.ClassNotFoundException: stack
      at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      at org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:148)
      at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274)
      at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274)
      at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:280)
      at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:280)
      at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:285)
      at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:285)
      at org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:291)
      at org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
      at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
      at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60)
      at scala.Option.map(Option.scala:145)
      at org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:60)
      at org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:70)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117)
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5444 from chenghao-intel/hive_udtf and squashes the following commits:
      
      065a98c [Cheng Hao] fix bug of Hive UDTF in Lateral View (ClassNotFound)
      3ceb810a
    • Marcelo Vanzin's avatar
      [hotfix] [build] Make sure JAVA_HOME is set for tests. · 694aef0d
      Marcelo Vanzin authored
      This is needed at least for YARN integration tests, since `$JAVA_HOME` is used to launch the executors.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5441 from vanzin/yarn-test-test and squashes the following commits:
      
      3eeec30 [Marcelo Vanzin] Use JAVA_HOME when available, java.home otherwise.
      d71f1bb [Marcelo Vanzin] And sbt too.
      6bda399 [Marcelo Vanzin] WIP: Testing to see whether this fixes the yarn test issue.
      694aef0d
    • Liang-Chi Hsieh's avatar
      [Minor][Core] Fix typo · 95a07591
      Liang-Chi Hsieh authored
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5466 from viirya/fix_ShuffleMapTask_typo and squashes the following commits:
      
      2789fd5 [Liang-Chi Hsieh] fix typo.
      95a07591
  4. Apr 10, 2015
    • Volodymyr Lyubinets's avatar
      [SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing... · 67d06880
      Volodymyr Lyubinets authored
      [SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection
      
      cc marmbrus
      
      Author: Volodymyr Lyubinets <vlyubin@gmail.com>
      
      Closes #5279 from vlyubin/speedup and squashes the following commits:
      
      e75a387 [Volodymyr Lyubinets] Changes to ScalaUDF
      11a20ec [Volodymyr Lyubinets] Avoid creating a tuple
      c327bc9 [Volodymyr Lyubinets] Moved the only remaining function from DataTypeConversions to DateUtils
      dec6802 [Volodymyr Lyubinets] Addresed review feedback
      74301fa [Volodymyr Lyubinets] Addressed review comments
      afa3aa5 [Volodymyr Lyubinets] Minor refactoring, added license, removed debug output
      881dc60 [Volodymyr Lyubinets] Moved to a separate module; addressed review comments; one extra place of usage; changed behaviour for Java
      8cad6e2 [Volodymyr Lyubinets] Addressed review commments
      41b2aa9 [Volodymyr Lyubinets] Creating converters for ScalaReflection stuff, and more
      67d06880
    • Michael Armbrust's avatar
      [SPARK-6851][SQL] Create new instance for each converted parquet relation · 23d5f886
      Michael Armbrust authored
      Otherwise we end up rewriting predicates to be trivially equal (i.e. `a#1 = a#2` -> `a#3 = a#3`), at which point the query is no longer valid.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5458 from marmbrus/selfJoinParquet and squashes the following commits:
      
      22df77c [Michael Armbrust] [SPARK-6851][SQL] Create new instance for each converted parquet relation
      23d5f886
    • Davies Liu's avatar
      [SPARK-6850] [SparkR] use one partition when we need to compare the whole result · 68ecdb7f
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5460 from davies/r_test and squashes the following commits:
      
      0a593ce [Davies Liu] use one partition when we need to compare the whole result
      68ecdb7f
    • Davies Liu's avatar
      [SPARK-6216] [PySpark] check the python version in worker · 4740d6a1
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5404 from davies/check_version and squashes the following commits:
      
      e559248 [Davies Liu] add tests
      ec33b5f [Davies Liu] check the python version in worker
      4740d6a1
Loading