Skip to content
Snippets Groups Projects
  1. Jan 26, 2016
  2. Jan 25, 2016
    • tedyu's avatar
      [SPARK-12934] use try-with-resources for streams · fdcc3512
      tedyu authored
      liancheng please take a look
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #10906 from tedyu/master.
      fdcc3512
    • Wenchen Fan's avatar
      [SPARK-12936][SQL] Initial bloom filter implementation · 109061f7
      Wenchen Fan authored
      This PR adds an initial implementation of bloom filter in the newly added sketch module.  The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java).
      
      Some difference from the design doc:
      
      * expose `bitSize` instead of `sizeInBytes` to user.
      * always need the `expectedInsertions` parameter when create bloom filter.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10883 from cloud-fan/bloom-filter.
      109061f7
    • Wenchen Fan's avatar
      [SPARK-12879] [SQL] improve the unsafe row writing framework · be375fcb
      Wenchen Fan authored
      As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use.
      
      This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.
      
      a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR:
      **old version**
      ```
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
      -------------------------------------------------------------------------------
      single long                             2616.04           102.61         1.00 X
      single nullable long                    3032.54            88.52         0.86 X
      primitive types                         9121.05            29.43         0.29 X
      nullable primitive types               12410.60            21.63         0.21 X
      ```
      
      **new version**
      ```
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
      -------------------------------------------------------------------------------
      single long                             1533.34           175.07         1.00 X
      single nullable long                    2306.73           116.37         0.66 X
      primitive types                         8403.93            31.94         0.18 X
      nullable primitive types               12448.39            21.56         0.12 X
      ```
      
      For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process.  The benchmark code is included in this PR.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10809 from cloud-fan/unsafe-projection.
      be375fcb
    • Cheng Lian's avatar
      [SPARK-12934][SQL] Count-min sketch serialization · 6f0f1d9e
      Cheng Lian authored
      This PR adds serialization support for `CountMinSketch`.
      
      A version number is added to version the serialized binary format.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10893 from liancheng/cms-serialization.
      6f0f1d9e
    • Yanbo Liang's avatar
      [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark · dcae355c
      Yanbo Liang authored
      ```PCAModel```  can output ```explainedVariance``` at Python side.
      
      cc mengxr srowen
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10830 from yanboliang/spark-12905.
      dcae355c
    • gatorsmile's avatar
      [SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part of Partitioning Columns · 9348431d
      gatorsmile authored
      When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example,
      ```
              df.write
                .format(source)
                .partitionBy("i")
                .bucketBy(8, "i", "k")
                .saveAsTable("bucketed_table")
      ```
      However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change.
      
      Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table.
      
      Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10891 from gatorsmile/commonKeysInPartitionByBucketBy.
      9348431d
    • Yin Huai's avatar
      00026fa9
    • Davies Liu's avatar
      [SPARK-12902] [SQL] visualization for generated operators · 7d877c34
      Davies Liu authored
      This PR brings back visualization for generated operators, they looks like:
      
      ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png)
      
      ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png)
      
      Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10828 from davies/viz_codegen.
      7d877c34
    • Alex Bozarth's avatar
      [SPARK-12149][WEB UI] Executor UI improvement suggestions - Color UI · c037d254
      Alex Bozarth authored
      Added color coding to the Executors page for Active Tasks, Failed Tasks, Completed Tasks and Task Time.
      
      Active Tasks is shaded blue with it's range based on percentage of total cores used.
      Failed Tasks is shaded red ranging over the first 10% of total tasks failed
      Completed Tasks is shaded green ranging over 10% of total tasks including failed and active tasks, but only when there are active or failed tasks on that executor.
      Task Time is shaded red when GC Time goes over 10% of total time with it's range directly corresponding to the percent of total time.
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #10154 from ajbozarth/spark12149.
      c037d254
    • Xiangrui Meng's avatar
      Closes #10879 · ef8fb361
      Xiangrui Meng authored
      Closes #9046
      Closes #8532
      Closes #10756
      Closes #8960
      Closes #10485
      Closes #10467
      ef8fb361
    • Yanbo Liang's avatar
      [SPARK-11965][ML][DOC] Update user guide for RFormula feature interactions · dd2325d9
      Yanbo Liang authored
      Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10222 from yanboliang/spark-11965.
      dd2325d9
    • Michael Allman's avatar
      [SPARK-12755][CORE] Stop the event logger before the DAG scheduler · 4ee8191e
      Michael Allman authored
      [SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped.
      
      This contribution is my original work, and I license this work to the Spark project under the project's open source license.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #10700 from mallman/stop_event_logger_first.
      4ee8191e
    • Andy Grove's avatar
      [SPARK-12932][JAVA API] improved error message for java type inference failure · d8e48052
      Andy Grove authored
      Author: Andy Grove <andygrove73@gmail.com>
      
      Closes #10865 from andygrove/SPARK-12932.
      d8e48052
    • hyukjinkwon's avatar
      [SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not case class and same format). · 3adebfc9
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-12901
      This PR refactors the options in JSON and CSV datasources.
      
      In more details,
      
      1. `JSONOptions` uses the same format as `CSVOptions`.
      2. Not case classes.
      3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed)
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #10895 from HyukjinKwon/SPARK-12901.
      3adebfc9
  3. Jan 24, 2016
    • Cheng Lian's avatar
      [SPARK-12624][PYSPARK] Checks row length when converting Java arrays to Python rows · 3327fd28
      Cheng Lian authored
      When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10886 from liancheng/spark-12624.
      3327fd28
    • Jeff Zhang's avatar
      [SPARK-12120][PYSPARK] Improve exception message when failing to init… · e789b1d2
      Jeff Zhang authored
      …ialize HiveContext in PySpark
      
      davies Mind to review ?
      
      This is the error message after this PR
      
      ```
      15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
      /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
        warnings.warn("You must build Spark with Hive. "
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
          return DataFrameReader(self)
        File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
          self._jreader = sqlContext._ssql_ctx.read()
        File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
          raise e
      py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
      : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
      	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
      	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
      	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
      	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
      	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
      	at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
      	at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
      	at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
      	at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
      	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
      	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
      	at py4j.Gateway.invoke(Gateway.java:214)
      	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
      	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
      	at py4j.GatewayConnection.run(GatewayConnection.java:209)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10126 from zjffdu/SPARK-12120.
      e789b1d2
    • Holden Karau's avatar
      [SPARK-10498][TOOLS][BUILD] Add requirements.txt file for dev python tools · a8340013
      Holden Karau authored
      Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning).
      
      cc JoshRosen who looked at the original JIRA.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.
      a8340013
    • Josh Rosen's avatar
      [SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT build · f4004601
      Josh Rosen authored
      ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive).
      
      This patch attempts to improve the isolation of these tests in order to address this issue.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
      f4004601
  4. Jan 23, 2016
Loading