Skip to content
Snippets Groups Projects
  1. Jan 25, 2016
  2. Jan 24, 2016
    • Cheng Lian's avatar
      [SPARK-12624][PYSPARK] Checks row length when converting Java arrays to Python rows · 3327fd28
      Cheng Lian authored
      When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10886 from liancheng/spark-12624.
      3327fd28
    • Jeff Zhang's avatar
      [SPARK-12120][PYSPARK] Improve exception message when failing to init… · e789b1d2
      Jeff Zhang authored
      …ialize HiveContext in PySpark
      
      davies Mind to review ?
      
      This is the error message after this PR
      
      ```
      15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
      /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
        warnings.warn("You must build Spark with Hive. "
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
          return DataFrameReader(self)
        File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
          self._jreader = sqlContext._ssql_ctx.read()
        File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
          raise e
      py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
      : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
      	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
      	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
      	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
      	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
      	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
      	at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
      	at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
      	at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
      	at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
      	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
      	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
      	at py4j.Gateway.invoke(Gateway.java:214)
      	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
      	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
      	at py4j.GatewayConnection.run(GatewayConnection.java:209)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10126 from zjffdu/SPARK-12120.
      e789b1d2
    • Holden Karau's avatar
      [SPARK-10498][TOOLS][BUILD] Add requirements.txt file for dev python tools · a8340013
      Holden Karau authored
      Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning).
      
      cc JoshRosen who looked at the original JIRA.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.
      a8340013
    • Josh Rosen's avatar
      [SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT build · f4004601
      Josh Rosen authored
      ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive).
      
      This patch attempts to improve the isolation of these tests in order to address this issue.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
      f4004601
  3. Jan 23, 2016
  4. Jan 22, 2016
  5. Jan 21, 2016
  6. Jan 20, 2016
    • Sun Rui's avatar
      [SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR. · 1b2a918e
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10201 from sun-rui/SPARK-12204.
      1b2a918e
    • smishra8's avatar
      [SPARK-12910] Fixes : R version for installing sparkR · d7415991
      smishra8 authored
      Testing code:
      ```
      $ ./install-dev.sh
      USING R_HOME = /usr/bin
      ERROR: this R is version 2.15.1, package 'SparkR' requires R >= 3.0
      ```
      
      Using the new argument:
      ```
      $ ./install-dev.sh /content/username/SOFTWARE/R-3.2.3
      USING R_HOME = /content/username/SOFTWARE/R-3.2.3/bin
      * installing *source* package ‘SparkR’ ...
      ** R
      ** inst
      ** preparing package for lazy loading
      Creating a new generic function for ‘colnames’ in package ‘SparkR’
      Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
      Creating a new generic function for ‘cov’ in package ‘SparkR’
      Creating a new generic function for ‘na.omit’ in package ‘SparkR’
      Creating a new generic function for ‘filter’ in package ‘SparkR’
      Creating a new generic function for ‘intersect’ in package ‘SparkR’
      Creating a new generic function for ‘sample’ in package ‘SparkR’
      Creating a new generic function for ‘transform’ in package ‘SparkR’
      Creating a new generic function for ‘subset’ in package ‘SparkR’
      Creating a new generic function for ‘summary’ in package ‘SparkR’
      Creating a new generic function for ‘lag’ in package ‘SparkR’
      Creating a new generic function for ‘rank’ in package ‘SparkR’
      Creating a new generic function for ‘sd’ in package ‘SparkR’
      Creating a new generic function for ‘var’ in package ‘SparkR’
      Creating a new generic function for ‘predict’ in package ‘SparkR’
      Creating a new generic function for ‘rbind’ in package ‘SparkR’
      Creating a generic function for ‘lapply’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘Filter’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘alias’ from package ‘stats’ in package ‘SparkR’
      Creating a generic function for ‘substr’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘%in%’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘mean’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘unique’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘nrow’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘ncol’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘head’ from package ‘utils’ in package ‘SparkR’
      Creating a generic function for ‘factorial’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘atan2’ from package ‘base’ in package ‘SparkR’
      Creating a generic function for ‘ifelse’ from package ‘base’ in package ‘SparkR’
      ** help
      No man pages found in package  ‘SparkR’
      *** installing help indices
      ** building package indices
      ** testing if installed package can be loaded
      * DONE (SparkR)
      
      ```
      
      Author: Shubhanshu Mishra <smishra8@illinois.edu>
      
      Closes #10836 from napsternxg/master.
      d7415991
    • Yin Huai's avatar
      d60f8d74
    • wangfei's avatar
      [SPARK-8968][SQL] external sort by the partition clomns when dynamic... · 015c8efb
      wangfei authored
      [SPARK-8968][SQL] external sort by the partition clomns when dynamic partitioning to optimize the memory overhead
      
      Now the hash based writer dynamic partitioning show the bad performance for big data and cause many small files and high GC. This patch we do external sort first so that each time we only need open one writer.
      
      before this patch:
      ![gc](https://cloud.githubusercontent.com/assets/7018048/9149788/edc48c6e-3dec-11e5-828c-9995b56e4d65.PNG)
      
      after this patch:
      ![gc-optimize-externalsort](https://cloud.githubusercontent.com/assets/7018048/9149794/60f80c9c-3ded-11e5-8a56-7ae18ddc7a2f.png)
      
      Author: wangfei <wangfei_hello@126.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #7336 from scwf/dynamic-optimize-basedon-apachespark.
      015c8efb
    • Davies Liu's avatar
      [SPARK-12797] [SQL] Generated TungstenAggregate (without grouping keys) · b362239d
      Davies Liu authored
      As discussed in #10786, the generated TungstenAggregate does not support imperative functions.
      
      For a query
      ```
      sqlContext.range(10).filter("id > 1").groupBy().count()
      ```
      
      The generated code will looks like:
      ```
      /* 032 */     if (!initAgg0) {
      /* 033 */       initAgg0 = true;
      /* 034 */
      /* 035 */       // initialize aggregation buffer
      /* 037 */       long bufValue2 = 0L;
      /* 038 */
      /* 039 */
      /* 040 */       // initialize Range
      /* 041 */       if (!range_initRange5) {
      /* 042 */         range_initRange5 = true;
             ...
      /* 071 */       }
      /* 072 */
      /* 073 */       while (!range_overflow8 && range_number7 < range_partitionEnd6) {
      /* 074 */         long range_value9 = range_number7;
      /* 075 */         range_number7 += 1L;
      /* 076 */         if (range_number7 < range_value9 ^ 1L < 0) {
      /* 077 */           range_overflow8 = true;
      /* 078 */         }
      /* 079 */
      /* 085 */         boolean primitive11 = false;
      /* 086 */         primitive11 = range_value9 > 1L;
      /* 087 */         if (!false && primitive11) {
      /* 092 */           // do aggregate and update aggregation buffer
      /* 099 */           long primitive17 = -1L;
      /* 100 */           primitive17 = bufValue2 + 1L;
      /* 101 */           bufValue2 = primitive17;
      /* 105 */         }
      /* 107 */       }
      /* 109 */
      /* 110 */       // output the result
      /* 112 */       bufferHolder25.reset();
      /* 114 */       rowWriter26.initialize(bufferHolder25, 1);
      /* 118 */       rowWriter26.write(0, bufValue2);
      /* 120 */       result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize());
      /* 121 */       currentRow = result24;
      /* 122 */       return;
      /* 124 */     }
      /* 125 */
      ```
      
      cc nongli
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10840 from davies/gen_agg.
      b362239d
    • Herman van Hovell's avatar
      [SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal · 10173279
      Herman van Hovell authored
      The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```.
      
      The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double.
      
      This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D```
      
      cc davies rxin
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10796 from hvanhovell/SPARK-12848.
      10173279
    • Wenchen Fan's avatar
      [SPARK-12888][SQL] benchmark the new hash expression · f3934a8d
      Wenchen Fan authored
      Benchmark it on 4 different schemas, the result:
      ```
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      Hash For simple:                   Avg Time(ms)    Avg Rate(M/s)  Relative Rate
      -------------------------------------------------------------------------------
      interpreted version                       31.47           266.54         1.00 X
      codegen version                           64.52           130.01         0.49 X
      ```
      
      ```
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      Hash For normal:                   Avg Time(ms)    Avg Rate(M/s)  Relative Rate
      -------------------------------------------------------------------------------
      interpreted version                     4068.11             0.26         1.00 X
      codegen version                         1175.92             0.89         3.46 X
      ```
      
      ```
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      Hash For array:                    Avg Time(ms)    Avg Rate(M/s)  Relative Rate
      -------------------------------------------------------------------------------
      interpreted version                     9276.70             0.06         1.00 X
      codegen version                        14762.23             0.04         0.63 X
      ```
      
      ```
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      Hash For map:                      Avg Time(ms)    Avg Rate(M/s)  Relative Rate
      -------------------------------------------------------------------------------
      interpreted version                    58869.79             0.01         1.00 X
      codegen version                         9285.36             0.06         6.34 X
      ```
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10816 from cloud-fan/hash-benchmark.
      f3934a8d
    • gatorsmile's avatar
      [SPARK-12616][SQL] Making Logical Operator `Union` Support Arbitrary Number of Children · 8f90c151
      gatorsmile authored
      The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one.
      
      `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #10577 from gatorsmile/unionAllMultiChildren.
      8f90c151
    • Shixiong Zhu's avatar
      [SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project · b7d74a60
      Shixiong Zhu authored
      Include the following changes:
      
      1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
      2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
      3. Update the ActorWordCount example and add the JavaActorWordCount example
      4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10744 from zsxwing/streaming-akka-2.
      b7d74a60
    • Shixiong Zhu's avatar
      [SPARK-12847][CORE][STREAMING] Remove StreamingListenerBus and post all... · 944fdadf
      Shixiong Zhu authored
      [SPARK-12847][CORE][STREAMING] Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
      
      Including the following changes:
      
      1. Add StreamingListenerForwardingBus to WrappedStreamingListenerEvent process events in `onOtherEvent` to StreamingListener
      2. Remove StreamingListenerBus
      3. Merge AsynchronousListenerBus and LiveListenerBus to the same class LiveListenerBus
      4. Add `logEvent` method to SparkListenerEvent so that EventLoggingListener can use it to ignore WrappedStreamingListenerEvents
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10779 from zsxwing/streaming-listener.
      944fdadf
    • Takahashi Hiroshi's avatar
      [SPARK-10263][ML] Add @Since annotation to ml.param and ml.* · e3727c40
      Takahashi Hiroshi authored
      Add Since annotations to ml.param and ml.*
      
      Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>
      Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp>
      
      Closes #8935 from taishi-oss/issue10263.
      e3727c40
    • Rajesh Balamohan's avatar
      [SPARK-12898] Consider having dummyCallSite for HiveTableScan · ab4a6bfd
      Rajesh Balamohan authored
      Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan.
      
      Author: Rajesh Balamohan <rbalamohan@apache.org>
      
      Closes #10825 from rajeshbalamohan/SPARK-12898.
      ab4a6bfd
    • Rajesh Balamohan's avatar
      [SPARK-12925][SQL] Improve HiveInspectors.unwrap for StringObjectIns… · e75e340a
      Rajesh Balamohan authored
      Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant.  Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png)
      
      Author: Rajesh Balamohan <rbalamohan@apache.org>
      
      Closes #10848 from rajeshbalamohan/SPARK-12925.
      e75e340a
    • Imran Younus's avatar
      [SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero... · 9753835c
      Imran Younus authored
      [SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero.
      
      This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.
      
      Author: Imran Younus <iyounus@us.ibm.com>
      
      Closes #10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
      9753835c
    • Gábor Lipták's avatar
      [SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests · 9bb35c5b
      Gábor Lipták authored
      This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.
      
      Author: Gábor Lipták <gliptak@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #10850 from mengxr/SPARK-11295.
      9bb35c5b
    • Yu ISHIKAWA's avatar
      [SPARK-6519][ML] Add spark.ml API for bisecting k-means · 9376ae72
      Yu ISHIKAWA authored
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #9604 from yu-iskw/SPARK-6519.
      9376ae72
Loading