Skip to content
Snippets Groups Projects
  1. Nov 03, 2014
    • Cheng Lian's avatar
      [SPARK-4202][SQL] Simple DSL support for Scala UDF · c238fb42
      Cheng Lian authored
      This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API.
      
      For the following test snippet
      
      ```scala
      case class KeyValue(key: Int, value: String)
      val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD
      def foo(a: Int, b: String) => a.toString + b
      ```
      
      the newly introduced DSL enables the following syntax
      
      ```scala
      import org.apache.spark.sql.catalyst.dsl._
      testData.select(Star(None), foo.call('key, 'value) as 'result)
      ```
      
      which is equivalent to
      
      ```scala
      testData.registerTempTable("testData")
      sqlContext.registerFunction("foo", foo)
      sql("SELECT *, foo(key, value) AS result FROM testData")
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3067 from liancheng/udf-dsl and squashes the following commits:
      
      f132818 [Cheng Lian] Adds DSL support for Scala UDF
      c238fb42
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
    • ravipesala's avatar
      [SPARK-4207][SQL] Query which has syntax like 'not like' is not working in Spark SQL · 2b6e1ce6
      ravipesala authored
      Queries which has 'not like' is not working spark sql.
      
      sql("SELECT * FROM records where value not like 'val%'")
       same query works in Spark HiveQL
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3075 from ravipesala/SPARK-4207 and squashes the following commits:
      
      35c11e7 [ravipesala] Supported 'not like' syntax in sql
      2b6e1ce6
    • fi's avatar
      [SPARK-4211][Build] Fixes hive.version in Maven profile hive-0.13.1 · df607da0
      fi authored
      instead of `hive.version=0.13.1`.
      e.g. mvn -Phive -Phive=0.13.1
      
      Note: `hive.version=0.13.1a` is the default property value. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected.
      References:  PR #2685, which resolved a package incompatibility issue with Hive-0.13.1 by introducing a special version Hive-0.13.1a
      
      Author: fi <coderfi@gmail.com>
      
      Closes #3072 from coderfi/master and squashes the following commits:
      
      7ca4b1e [fi] Fixes the `hive-0.13.1` maven profile referencing `hive.version=0.13.1` instead of the Spark compatible `hive.version=0.13.1a` Note: `hive.version=0.13.1a` is the default version. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected. e.g. mvn -Phive -Phive=0.13.1 See PR #2685
      df607da0
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
    • Nicholas Chammas's avatar
      [EC2] Factor out Mesos spark-ec2 branch · 2aca97c7
      Nicholas Chammas authored
      We reference a specific branch in two places. This patch makes it one place.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the following commits:
      
      10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch
      2aca97c7
    • zsxwing's avatar
      [SPARK-4163][Core][WebUI] Send the fetch failure message back to Web UI · 76386e1a
      zsxwing authored
      This is a PR to send the fetch failure message back to Web UI.
      Before:
      ![f1](https://cloud.githubusercontent.com/assets/1000778/4856595/1f036c80-60be-11e4-956f-335147fbccb7.png)
      ![f2](https://cloud.githubusercontent.com/assets/1000778/4856596/1f11cbea-60be-11e4-8fe9-9f9b2b35c884.png)
      
      After (Please ignore the meaning of exception, I threw it in the code directly because it's hard to simulate a fetch failure):
      ![e1](https://cloud.githubusercontent.com/assets/1000778/4856600/2657ea38-60be-11e4-9f2d-d56c5f900f10.png)
      ![e2](https://cloud.githubusercontent.com/assets/1000778/4856601/26595008-60be-11e4-912b-2744af786991.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3032 from zsxwing/SPARK-4163 and squashes the following commits:
      
      f7e1faf [zsxwing] Discard changes for FetchFailedException and minor modification
      4e946f7 [zsxwing] Add e as the cause of SparkException
      316767d [zsxwing] Add private[storage] to FetchResult
      d51b0b6 [zsxwing] Set e as the cause of FetchFailedException
      b88c919 [zsxwing] Use 'private[storage]' for case classes instead of 'sealed'
      62103fd [zsxwing] Update as per review
      0c07d1f [zsxwing] Backward-compatible support
      a3bca65 [zsxwing] Send the fetch failure message back to Web UI
      76386e1a
    • wangfei's avatar
      [SPARK-4177][Doc]update build doc since JDBC/CLI support hive 13 now · 001acc44
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3042 from scwf/patch-9 and squashes the following commits:
      
      3784ed1 [wangfei] remove 'TODO'
      1891553 [wangfei] update build doc since JDBC/CLI support hive 13
      001acc44
  2. Nov 02, 2014
    • Reynold Xin's avatar
      Close #2971. · d6e4c591
      Reynold Xin authored
      d6e4c591
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 1ae51f6d
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      1ae51f6d
    • Joseph K. Bradley's avatar
      [SPARK-3572] [SQL] Internal API for User-Defined Types · ebd64805
      Joseph K. Bradley authored
      This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3063 from marmbrus/udts and squashes the following commits:
      
      7ccfc0d [Michael Armbrust] remove println
      46a3aee [Michael Armbrust] Slightly easier to read test output.
      6cc434d [Michael Armbrust] Recursively convert rows.
      e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts
      15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2
      f3c72fe [Joseph K. Bradley] Fixing merge
      e13cd8a [Joseph K. Bradley] Removed Vector UDTs
      5817b2b [Joseph K. Bradley] style edits
      30ce5b2 [Joseph K. Bradley] updates based on code review
      d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean
      a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs).  Cleaned up other code.  Extended JavaUserDefinedTypeSuite
      6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean
      20630bc [Joseph K. Bradley] fixed scalastyle
      fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now
      8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument.
      8b242ea [Joseph K. Bradley] Fixed merge error after last merge.  Note: Last merge commit also removed SQL UDT examples from mllib.
      7f29656 [Joseph K. Bradley] Moved udt case to top of all matches.  Small cleanups
      b028675 [Xiangrui Meng] allow any type in UDT
      4500d8a [Xiangrui Meng] update example code
      87264a5 [Xiangrui Meng] remove debug code
      3143ac3 [Xiangrui Meng] remove unnecessary changes
      cfbc321 [Xiangrui Meng] support UDT in parquet
      db16139 [Joseph K. Bradley] Added more doc for UserDefinedType.  Removed unused code in Suite
      759af7a [Joseph K. Bradley] Added more doc to UserDefineType
      63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions
      51e5282 [Joseph K. Bradley] fixed 1 test
      f025035 [Joseph K. Bradley] Cleanups before PR.  Added new tests
      85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors.
      dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs
      cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run
      34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL.
      e1f7b9c [Joseph K. Bradley] blah
      2f40c02 [Joseph K. Bradley] renamed UDT types
      3579035 [Joseph K. Bradley] udt annotation now working
      b226b9e [Joseph K. Bradley] Changing UDT to annotation
      fea04af [Joseph K. Bradley] more cleanups
      964b32e [Joseph K. Bradley] some cleanups
      893ee4c [Joseph K. Bradley] udt finallly working
      50f9726 [Joseph K. Bradley] udts
      04303c9 [Joseph K. Bradley] udts
      39f8707 [Joseph K. Bradley] removed old udt suite
      273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done
      8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging
      53de70f [Joseph K. Bradley] more udts...
      982c035 [Joseph K. Bradley] still working on UDTs
      19b2f60 [Joseph K. Bradley] still working on UDTs
      0eaeb81 [Joseph K. Bradley] Still working on UDTs
      105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.
      ebd64805
    • Aaron Davidson's avatar
      [SPARK-4183] Close transport-related resources between SparkContexts · 2ebd1df3
      Aaron Davidson authored
      A leak of event loops may be causing test failures.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3053 from aarondav/leak and squashes the following commits:
      
      e676d18 [Aaron Davidson] Typo!
      8f96475 [Aaron Davidson] Keep original ssc semantics
      7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
      2ebd1df3
    • Cheng Lian's avatar
      [SPARK-2189][SQL] Adds dropTempTable API · 9081b9f9
      Cheng Lian authored
      This PR adds an API for unregistering temporary tables. If a temporary table has been cached before, it's unpersisted as well.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #3039 from liancheng/unregister-temp-table and squashes the following commits:
      
      54ae99f [Cheng Lian] Fixes Scala styling issue
      1948c14 [Cheng Lian] Removes the unpersist argument
      aca41d3 [Cheng Lian] Ensures thread safety
      7d4fb2b [Cheng Lian] Adds unregisterTempTable API
      9081b9f9
    • Yin Huai's avatar
      [SPARK-4185][SQL] JSON schema inference failed when dealing with type conflicts in arrays · 06232d23
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-4185.
      
      This PR also has the fix of #3052.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #3056 from yhuai/SPARK-4185 and squashes the following commits:
      
      ed3a5a8 [Yin Huai] Correctly handle type conflicts between structs and primitive types in an array.
      06232d23
    • wangfei's avatar
      [SPARK-4191][SQL]move wrapperFor to HiveInspectors to reuse it · e749f5de
      wangfei authored
      Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support)
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3057 from scwf/reuse-wraperfor and squashes the following commits:
      
      7ccf932 [scwf] fix conflicts
      d44f4da [wangfei] fix imports
      9bf1b50 [wangfei] revert no related change
      9a5276a [wangfei] move wrapfor to hiveinspector to reuse them
      e749f5de
    • Cheng Lian's avatar
      [SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2 · c9f84004
      Cheng Lian authored
      This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.
      
      TODO
      
      - [x] Find a general way to figure out Hive (or even any dependency) version.
      
        This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:
      
        1. must applies to both Maven build and SBT build
      
          For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.
      
        2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.
      
          This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.
      
        3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.
      
           `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.
      
        Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?
      
      **Update** Hive version is now retrieved from the newly introduced `HiveShim` object.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2843 from liancheng/get-info and squashes the following commits:
      
      a873d0f [Cheng Lian] Updates test case
      53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
      1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
      f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
      c9f84004
    • Cheng Lian's avatar
      [SQL] Fixes race condition in CliSuite · 495a1320
      Cheng Lian authored
      `CliSuite` has been flaky for a while, this PR tries to improve this situation by fixing a race condition in `CliSuite`. The `captureOutput` function is used to capture both stdout and stderr output of the forked external process in two background threads and search for expected strings, but wasn't been properly synchronized before.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3060 from liancheng/fix-cli-suite and squashes the following commits:
      
      a70569c [Cheng Lian] Fixes race condition in CliSuite
      495a1320
    • Cheng Lian's avatar
      [SPARK-4182][SQL] Fixes ColumnStats classes for boolean, binary and complex data types · e4b80894
      Cheng Lian authored
      `NoopColumnStats` was once used for binary, boolean and complex data types. This `ColumnStats` doesn't return properly shaped column statistics and causes caching failure if a table contains columns of the aforementioned types.
      
      This PR adds `BooleanColumnStats`, `BinaryColumnStats` and `GenericColumnStats`, used for boolean, binary and all complex data types respectively. In addition, `NoopColumnStats` returns properly shaped column statistics containing null count and row count, but this class is now used for testing purpose only.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3059 from liancheng/spark-4182 and squashes the following commits:
      
      b398cfd [Cheng Lian] Fixes failed test case
      fb3ee85 [Cheng Lian] Fixes SPARK-4182
      e4b80894
    • Michael Armbrust's avatar
      [SPARK-3247][SQL] An API for adding data sources to Spark SQL · 9c0eb57c
      Michael Armbrust authored
      This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`.
      
      New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data.  BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects.  The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations.
      
      By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL.  I've used the functionality to update the JSON support so it can now be used in this way as follows:
      
      ```sql
      CREATE TEMPORARY TABLE jsonTableSQL
      USING org.apache.spark.sql.json
      OPTIONS (
        path '/home/michael/data.json'
      )
      ```
      
      Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources
      
      There is also a library that uses this new API to read avro data available here:
      https://github.com/marmbrus/sql-avro
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2475 from marmbrus/foreign and squashes the following commits:
      
      1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      ab2c31f [Michael Armbrust] fix test
      1d41bb5 [Michael Armbrust] unify argument names
      5b47901 [Michael Armbrust] Remove sealed, more filter types
      fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      e3e690e [Michael Armbrust] Add hook for extraStrategies
      a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite
      70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java
      7d948ae [Michael Armbrust] Fix equality of AttributeReference.
      5545491 [Michael Armbrust] Address comments
      5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      22963ef [Michael Armbrust] package objects compile wierdly...
      b069146 [Michael Armbrust] traits => abstract classes
      34f836a [Michael Armbrust] Make @DeveloperApi
      0d74bcf [Michael Armbrust] Add documention on object life cycle
      3e06776 [Michael Armbrust] remove line wraps
      de3b68c [Michael Armbrust] Remove empty file
      360cb30 [Michael Armbrust] style and java api
      2957875 [Michael Armbrust] add override
      0fd3a07 [Michael Armbrust] Draft of data sources API
      9c0eb57c
    • wangfei's avatar
      [HOTFIX][SQL] hive test missing some golden files · f0a4b630
      wangfei authored
      cc marmbrus
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3055 from scwf/hotfix and squashes the following commits:
      
      d881bd7 [wangfei] miss golden files
      f0a4b630
    • zsxwing's avatar
      [SPARK-4166][Core][WebUI] Display the executor ID in the Web UI when ExecutorLostFailure happens · 4e6a7a0b
      zsxwing authored
      Now when ExecutorLostFailure happens, it only displays `ExecutorLostFailure (executor lost)`. Adding the executor id will help locate the faulted executor.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3033 from zsxwing/SPARK-4166 and squashes the following commits:
      
      ff4664c [zsxwing] Backward-compatible support
      c5c4cf2 [zsxwing] Display the executor ID in the Web UI when ExecutorLostFailure happens
      4e6a7a0b
    • Davies Liu's avatar
      [SPARK-3466] Limit size of results that a driver collects for each action · 6181577e
      Davies Liu authored
      Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.
      
      This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.
      
      By default, it's 1g (for backward compatibility for most the cases).
      
      In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.
      
      cc mateiz
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3003 from davies/collect and squashes the following commits:
      
      248ed5e [Davies Liu] fix compile
      272522e [Davies Liu] address comments
      2c35773 [Davies Liu] add sizes in message of abort()
      5d62303 [Davies Liu] address comments
      bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
      11f97c5 [Davies Liu] address comments
      47b144f [Davies Liu] check the size of result before send and fetch
      3d81af2 [Davies Liu] address comments
      ca8267d [Davies Liu] limit the size of data by collect
      6181577e
  3. Nov 01, 2014
    • Matei Zaharia's avatar
      [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations · 23f966f4
      Matei Zaharia authored
      - Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
      - Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs
      
      This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #2983 from mateiz/decimal-1 and squashes the following commits:
      
      35e6b02 [Matei Zaharia] Fix issues after merge
      227f24a [Matei Zaharia] Review comments
      31f915e [Matei Zaharia] Implement Davies's suggestions in Python
      eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet
      4dc6bae [Matei Zaharia] Fix decimal support in PySpark
      d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase
      b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore
      2118c0d [Matei Zaharia] Some test and bug fixes
      81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions
      7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now
      ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
      23f966f4
    • Sung Chung's avatar
      [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... · 56f2c61c
      Sung Chung authored
      ...sion trees. jkbradley mengxr chouqin Please review this.
      
      Author: Sung Chung <schung@alpinenow.com>
      
      Closes #2868 from codedeft/SPARK-3161 and squashes the following commits:
      
      5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
      56f2c61c
    • Xiangrui Meng's avatar
      [SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading · d8176b1c
      Xiangrui Meng authored
      In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes.
      
      This PR sets commons-math3 version based on hadoop profiles.
      
      pwendell JoshRosen srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits:
      
      580f6d9 [Xiangrui Meng] replace tab by spaces
      7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts
      d3353d9 [Xiangrui Meng] do not shade commons-math3
      b4180dc [Xiangrui Meng] temp work
      d8176b1c
    • Patrick Wendell's avatar
      Revert "[SPARK-4183] Enable NettyBlockTransferService by default" · 7894de27
      Patrick Wendell authored
      This reverts commit 59e626c7.
      7894de27
    • Cheng Lian's avatar
      [SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2 · ad0fde10
      Cheng Lian authored
      `HiveThriftServer2` creates a global singleton `SessionState` instance and overrides `HiveContext` to inject the `SessionState` object. This messes up `SessionState` initialization and causes problems.
      
      This PR replaces the global `SessionState` with `HiveContext.sessionState` to avoid the initialization conflict. Also `HiveContext` reuses existing started `SessionState` if any (this is required by `SparkSQLCLIDriver`, which uses specialized `CliSessionState`).
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2887 from liancheng/spark-4037 and squashes the following commits:
      
      8446675 [Cheng Lian] Removes redundant Driver initialization
      a28fef5 [Cheng Lian] Avoid starting HiveContext.sessionState multiple times
      49b1c5b [Cheng Lian] Reuses existing started SessionState if any
      3cd6fab [Cheng Lian] Fixes SPARK-4037
      ad0fde10
    • Aaron Davidson's avatar
      [SPARK-3796] Create external service which can serve shuffle files · f55218ae
      Aaron Davidson authored
      This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
      
      This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
      
      There are several outstanding tasks which must be complete before this PR can be merged:
      - [x] Complete unit testing of network/shuffle package.
      - [x] Performance and correctness testing on a real cluster.
      - [x] Remove example service instantiation from Worker.scala.
      
      There are even more shortcomings of this PR which should be addressed in followup patches:
      - Don't use Java serializer for RPC layer! It is not cross-version compatible.
      - Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
      - Documentation of the feature in the Spark docs.
      - Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
      - SSL and SASL integration
      - Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3001 from aarondav/shuffle-service and squashes the following commits:
      
      4d1f8c1 [Aaron Davidson] Remove changes to Worker
      705748f [Aaron Davidson] Rename Standalone* to External*
      fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
      9883918 [Aaron Davidson] Make suggested build changes
      3d62679 [Aaron Davidson] Add Spark integration test
      7fe51d5 [Aaron Davidson] Fix SBT integration
      56caa50 [Aaron Davidson] Address comments
      c8d1ac3 [Aaron Davidson] Add unit tests
      2f70c0c [Aaron Davidson] Fix unit tests
      5483e96 [Aaron Davidson] Fix unit tests
      46a70bf [Aaron Davidson] Whoops, bracket
      5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
      f55218ae
    • Xiangrui Meng's avatar
      [SPARK-3569][SQL] Add metadata field to StructField · 1d4f3552
      Xiangrui Meng authored
      Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.
      
      Metadata is preserved through simple operations like `SELECT`.
      
      marmbrus liancheng
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2701 from mengxr/structfield-metadata and squashes the following commits:
      
      dedda56 [Xiangrui Meng] merge remote
      5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
      886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
      589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      1e2abcf [Xiangrui Meng] change default value of metadata to None in python
      611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
      ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
      3f49aab [Xiangrui Meng] remove StructField.toString
      24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      473a7c5 [Xiangrui Meng] merge master
      c9d7301 [Xiangrui Meng] organize imports
      1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
      60cc131 [Xiangrui Meng] add doc and header
      60614c7 [Xiangrui Meng] add metadata
      e42c452 [Xiangrui Meng] merge master
      93518fb [Xiangrui Meng] support metadata in python
      905bb89 [Xiangrui Meng] java conversions
      618e349 [Xiangrui Meng] make tests work in scala
      61b8e0f [Xiangrui Meng] merge master
      7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
      c41a664 [Xiangrui Meng] merge master
      d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
      67fdebb [Xiangrui Meng] add test on join
      d65072e [Xiangrui Meng] remove Map.empty
      367d237 [Xiangrui Meng] add test
      c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
      1d4f3552
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 59e626c7
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      59e626c7
    • Kevin Mader's avatar
      [SPARK-2759][CORE] Generic Binary File Support in Spark · 7136719b
      Kevin Mader authored
      The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the ```def parseByteArray(inArray: Array[Byte]): T``` function.
      As a trivial example ```ByteInputFormat``` and ```ByteRecordReader``` are included which just return the Array[Byte] from a given file.
      Finally a RDD for ```BinaryFileInputFormat``` (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the ```SparkContext``` so the functions can be easily used by others.
      A common use case might be to read in a folder
      ```
      sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))
      ```
      
      Author: Kevin Mader <kevinmader@gmail.com>
      Author: Kevin Mader <kmader@users.noreply.github.com>
      
      Closes #1658 from kmader/master and squashes the following commits:
      
      3c49a30 [Kevin Mader] fixing wholetextfileinput to it has the same setMinPartitions function as in BinaryData files
      359a096 [Kevin Mader] making the final corrections suggested by @mateiz and renaming a few functions to make their usage clearer
      6379be4 [Kevin Mader] reorganizing code
      7b9d181 [Kevin Mader] removing developer API, cleaning up imports
      8ac288b [Kevin Mader] fixed a single slightly over 100 character line
      92bda0d [Kevin Mader] added new tests, renamed files, fixed several of the javaapi functions, formatted code more nicely
      a32fef7 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
      49174d9 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
      c27a8f1 [Kevin Mader] jenkins crashed before running anything last time, so making minor change
      b348ce1 [Kevin Mader] fixed order in check (prefix only appears on jenkins not when I run unit tests locally)
      0588737 [Kevin Mader] filename check in "binary file input as byte array" test now ignores prefixes and suffixes which might get added by Hadoop
      4163e38 [Kevin Mader] fixing line length and output from FSDataInputStream to DataInputStream to minimize sensitivity to Hadoop API changes
      19812a8 [Kevin Mader] Fixed the serialization issue with PortableDataStream since neither CombineFileSplit nor TaskAttemptContext implement the Serializable interface, by using ByteArrays for storing both and then recreating the objects from these bytearrays as needed.
      238c83c [Kevin Mader] fixed several scala-style issues, changed structure of binaryFiles, removed excessive classes added new tests. The caching tests still have a serialization issue, but that should be easily fixed as well.
      932a206 [Kevin Mader] Update RawFileInput.scala
      a01c9cf [Kevin Mader] Update RawFileInput.scala
      441f79a [Kevin Mader] fixed a few small comments and dependency
      12e7be1 [Kevin Mader] removing imglib from maven (definitely not ready yet)
      5deb79e [Kevin Mader] added new portabledatastream to code so that it can be serialized correctly
      f032bc0 [Kevin Mader] fixed bug in path name, renamed tests
      bc5c0b9 [Kevin Mader] made minor stylistic adjustments from mateiz
      df8e528 [Kevin Mader] fixed line lengths and changed java test
      9a313d5 [Kevin Mader] making classes that needn't be public private, adding automatic file closure, adding new tests
      edf5829 [Kevin Mader] fixing line lengths, adding new lines
      f4841dc [Kevin Mader] un-optimizing imports, silly intellij
      eacfaa6 [Kevin Mader] Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile
      1622935 [Kevin Mader] changing the line lengths to make jenkins happy
      1cfa38a [Kevin Mader] added apache headers, added datainputstream directly as an output option for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api
      84035f1 [Kevin Mader] adding binary and byte file support spark
      81c5f12 [Kevin Mader] Merge pull request #1 from apache/master
      7136719b
    • luluorta's avatar
      [SPARK-4115][GraphX] Add overrided count for edge counting of EdgeRDD. · ee29ef38
      luluorta authored
      Accumulate sizes of all the EdgePartitions just like the VertexRDD.
      
      Author: luluorta <luluorta@gmail.com>
      
      Closes #2975 from luluorta/graph-edge-count and squashes the following commits:
      
      86ef0e5 [luluorta] Add overrided count for edge counting of EdgeRDD.
      ee29ef38
    • Joseph E. Gonzalez's avatar
      [SPARK-4142][GraphX] Default numEdgePartitions · f4e0b28c
      Joseph E. Gonzalez authored
      Changing the default number of edge partitions to match spark parallelism.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3006 from jegonzal/default_partitions and squashes the following commits:
      
      a9a5c4f [Joseph E. Gonzalez] Changing the default number of edge partitions to match spark parallelism
      f4e0b28c
    • Daniel Lemire's avatar
      Upgrading to roaring 0.4.5 (bug fix release) · 680fd87c
      Daniel Lemire authored
      I recommend upgrading roaring to 0.4.5 as it fixes a rarely occurring bug in iterators (that would otherwise throw an unwarranted exception). The upgrade should have no other consequence.
      
      Author: Daniel Lemire <lemire@gmail.com>
      
      Closes #3044 from lemire/master and squashes the following commits:
      
      54018c5 [Daniel Lemire] Recommended update to roaring 0.4.5 (bug fix release)
      048933e [Daniel Lemire] Merge remote-tracking branch 'upstream/master'
      431f3a0 [Daniel Lemire] Recommended bug fix release
      680fd87c
    • freeman's avatar
      Streaming KMeans [MLLIB][SPARK-3254] · 98c556eb
      freeman authored
      This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.
      
      The PR includes:
      - StreamingKMeans algorithm with decay factor settings
      - Usage example
      - Additions to documentation clustering page
      - Unit tests of basic behavior and decay behaviors
      
      tdas mengxr rezazadeh
      
      Author: freeman <the.freeman.lab@gmail.com>
      Author: Jeremy Freeman <the.freeman.lab@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits:
      
      b2e5b4a [freeman] Fixes to docs / examples
      078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
      2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
      0411bf5 [freeman] Change decay parameterization
      9f7aea9 [freeman] Style fixes
      374a706 [freeman] Formatting
      ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
      77dbd3f [freeman] Make initialization check an assertion
      9cfc301 [freeman] Make random seed an argument
      44050a9 [freeman] Simpler constructor
      c7050d5 [freeman] Fix spacing
      2899623 [freeman] Use pattern matching for clarity
      a4a316b [freeman] Use collect
      1472ec5 [freeman] Doc formatting
      ea22ec8 [freeman] Fix imports
      2086bdc [freeman] Log cluster center updates
      ea9877c [freeman] More documentation
      9facbe3 [freeman] Bug fix
      5db7074 [freeman] Example usage for StreamingKMeans
      f33684b [freeman] Add explanation and example to docs
      b5b5f8d [freeman] Add better documentation
      a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
      9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
      b93350f [freeman] Streaming KMeans with decay
      98c556eb
  4. Oct 31, 2014
    • Manish Amde's avatar
      [MLLIB] SPARK-1547: Add Gradient Boosting to MLlib · 86021955
      Manish Amde authored
      Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work.
      
      Ideally, boosting algorithms should work with any base learners.  This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default.
      
      Here is the task list:
      - [x] Gradient boosting support
      - [x] Pluggable loss functions
      - [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest.
      - [x] Binary classification support
      - [x] Support configurable checkpointing – This approach will avoid long lineage chains.
      - [x] Create classification and regression APIs
      - [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting.
      - [x] Unit Tests
      
      Future work:
      + Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function.
      + BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed.
      
      cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      
      Closes #2607 from manishamde/gbt and squashes the following commits:
      
      991c7b5 [Manish Amde] public api
      ff2a796 [Manish Amde] addressing comments
      b4c1318 [Manish Amde] removing spaces
      8476b6b [Manish Amde] fixing line length
      0183cb9 [Manish Amde] fixed naming and formatting issues
      1c40c33 [Manish Amde] add newline, removed spaces
      e33ab61 [Manish Amde] minor comment
      eadbf09 [Manish Amde] parameter renaming
      035a2ed [Manish Amde] jkbradley formatting suggestions
      9f7359d [Manish Amde] simplified gbt logic and added more tests
      49ba107 [Manish Amde] merged from master
      eff21fe [Manish Amde] Added gradient boosting tests
      3fd0528 [Manish Amde] moved helper methods to new class
      a32a5ab [Manish Amde] added test for subsampling without replacement
      781542a [Manish Amde] added support for fractional subsampling with replacement
      3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite
      0e81906 [Manish Amde] improving caching unpersisting logic
      d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class
      fee06d3 [Manish Amde] added weighted ensemble model
      1b01943 [Manish Amde] add weights for base learners
      9bc6e74 [Manish Amde] adding random seed as parameter
      d2c8323 [Manish Amde] Merge branch 'master' into gbt
      2ae97b7 [Manish Amde] added documentation for the loss classes
      9366b8f [Manish Amde] minor: using numTrees instead of trees.size
      3b43896 [Manish Amde] added learning rate for prediction
      9b2e35e [Manish Amde] Merge branch 'master' into gbt
      6a11c02 [manishamde] fixing formatting
      823691b [Manish Amde] fixing RF test
      1f47941 [Manish Amde] changing access modifier
      5b67102 [Manish Amde] shortened parameter list
      5ab3796 [Manish Amde] minor reformatting
      9155a9d [Manish Amde] consolidated boosting configuration and added public API
      631baea [Manish Amde] Merge branch 'master' into gbt
      2cb1258 [Manish Amde] public API support
      3b8ffc0 [Manish Amde] added documentation
      8e10c63 [Manish Amde] modified unpersist strategy
      f62bc48 [Manish Amde] added unpersist
      bdca43a [Manish Amde] added timing parameters
      2fbc9c7 [Manish Amde] fixing binomial classification prediction
      6dd4dd8 [Manish Amde] added support for log loss
      9af0231 [Manish Amde] classification attempt
      62cc000 [Manish Amde] basic checkpointing
      4784091 [Manish Amde] formatting
      78ed452 [Manish Amde] added newline and fixed if statement
      3973dd1 [Manish Amde] minor indicating subsample is double during comparison
      aa8fae7 [Manish Amde] minor refactoring
      1a8031c [Manish Amde] sampling with replacement
      f1c9ef7 [Manish Amde] Merge branch 'master' into gbt
      cdceeef [Manish Amde] added documentation
      6251fd5 [Manish Amde] modified method name
      5538521 [Manish Amde] disable checkpointing for now
      0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches
      86021955
    • Anant's avatar
      [SPARK-3838][examples][mllib][python] Word2Vec example in python · e07fb6a4
      Anant authored
      This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838
      
      Python example for word2vec
      mengxr
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #2952 from anantasty/SPARK-3838 and squashes the following commits:
      
      87bd723 [Anant] remove stop line
      4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
      3d3c9ee [Anant] Added empty line after python imports
      0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
      ee4f5f6 [Anant] Fixes from code review comments
      c637bcf [Anant] Added word2vec python example to docs
      269f31f [Anant] added example in docs
      c015b14 [Anant] Added python example for word2vec
      e07fb6a4
    • Alexander Ulanov's avatar
      [MLLIB] SPARK-2329 Add multi-label evaluation metrics · 62d01d25
      Alexander Ulanov authored
      Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329
      
      Multi-class measures are currently in the following pull request: https://github.com/apache/spark/pull/1155
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      Author: avulanov <nashb@yandex.ru>
      
      Closes #1270 from avulanov/multilabelmetrics and squashes the following commits:
      
      fc8175e [Alexander Ulanov] Merge with previous updates
      43a613e [Alexander Ulanov] Addressing reviewers comments: change Set to Array
      517a594 [avulanov] Addressing reviewers comments: Scala style
      cf4222bc [avulanov] Addressing reviewers comments: renaming. Added label method that returns the list of labels
      1843f73 [Alexander Ulanov] Scala style fix
      79e8476 [Alexander Ulanov] Replacing fold(_ + _) with sum as suggested by srowen
      ca46765 [Alexander Ulanov] Cosmetic changes: Apache header and parameter explanation
      40593f5 [Alexander Ulanov] Multi-label metrics: Hamming-loss, strict and normal accuracy, fix to macro measures, bunch of tests
      ad62df0 [Alexander Ulanov] Comments and scala style check
      154164b [Alexander Ulanov] Multilabel evaluation metics and tests: macro precision and recall averaged by docs, micro and per-class precision and recall averaged by class
      62d01d25
    • Sandy Ryza's avatar
      SPARK-4175. Exception on stage page · 23f73f52
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3043 from sryza/sandy-spark-4175 and squashes the following commits:
      
      e327340 [Sandy Ryza] SPARK-4175. Exception on stage page
      23f73f52
    • andrewor14's avatar
      [HOT FIX] Yarn stable tests don't compile · 087e31a7
      andrewor14 authored
      This is caused by this commit: acd4ac7c
      
      Author: andrewor14 <andrew@databricks.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3041 from andrewor14/yarn-hot-fix and squashes the following commits:
      
      e5deba1 [andrewor14] Add new line at the end (minor)
      aa998e8 [Andrew Or] Compilation hot fix
      087e31a7
Loading