Skip to content
Snippets Groups Projects
  1. Jul 08, 2015
    • Cheng Lian's avatar
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for... · 4ffc27ca
      Cheng Lian authored
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
      
      This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
      
      ### Major changes
      
      1. `CatalystConverter` class hierarchy refactoring
      
         - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
      
           Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
      
           This simplifies the design since converters don't need to care about details of their parent converters anymore.
      
         - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
      
           Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
      
         - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
      
           `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
      
           The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
      
         - Implements backwards-compatibility rules in `CatalystArrayConverter`
      
           When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
      
      2. Requested columns handling
      
         When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
      
         In this PR, the schema for requested columns is constructed using the following method:
      
         - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
         - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
         - Unions all single-field `MessageType`s into a full schema containing all requested fields
      
         With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
      
      ### Testing
      
      This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
      
      [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
      [2]: https://issues.apache.org/jira/browse/SPARK-6774
      [3]: https://issues.apache.org/jira/browse/SPARK-6123
      [4]: https://issues.apache.org/jira/browse/SPARK-8848
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7231 from liancheng/spark-6776 and squashes the following commits:
      
      360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
      c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
      b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
      598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
      926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
      7946ee1 [Cheng Lian] Fixes Scala styling issues
      3d7ab36 [Cheng Lian] Fixes .rat-excludes
      a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
      f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
      1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
      440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
      13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
      06cfe9d [Cheng Lian] Adds comments about TimestampType handling
      a099d3e [Cheng Lian] More comments
      0cc1b37 [Cheng Lian] Fixes MiMa checks
      884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
      802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
      38fe1e7 [Cheng Lian] Adds explicit return type
      7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
      1781dff [Cheng Lian] Adds test case for SPARK-8811
      6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
      bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
      a74fb2c [Cheng Lian] More comments
      0525346 [Cheng Lian] Removes old Parquet record converters
      03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
      4ffc27ca
    • Daniel Darabos's avatar
      [SPARK-8902] Correctly print hostname in error · 5687f765
      Daniel Darabos authored
      With "+" the strings are separate expressions, and format() is called on the last string before concatenation. (So substitution does not happen.) Without "+" the string literals are merged first by the parser, so format() is called on the complete string.
      
      Should I make a JIRA for this?
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      
      Closes #7288 from darabos/patch-2 and squashes the following commits:
      
      be0d3b7 [Daniel Darabos] Correctly print hostname in error
      5687f765
    • DB Tsai's avatar
      [SPARK-8700][ML] Disable feature scaling in Logistic Regression · 57221934
      DB Tsai authored
      All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
      
      In R, there is an option for this.
      `standardize`
      Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
      
      +cc holdenk mengxr jkbradley
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #7080 from dbtsai/lors and squashes the following commits:
      
      877e6c7 [DB Tsai] repahse the doc
      7cf45f2 [DB Tsai] address feedback
      78d75c9 [DB Tsai] small change
      c2c9e60 [DB Tsai] style
      6e1a8e0 [DB Tsai] first commit
      57221934
    • Cheolsoo Park's avatar
      [SPARK-8908] [SQL] Add () to distinct definition in dataframe · 00b265f1
      Cheolsoo Park authored
      Adding `()` to the definition of `distinct` in DataFrame allows distinct to be called with parentheses, which is consistent with `dropDuplicates`.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7298 from piaozhexiu/SPARK-8908 and squashes the following commits:
      
      7f0d923 [Cheolsoo Park] Add () to distinct definition in dataframe
      00b265f1
    • Alok Singh's avatar
      [SPARK-8909][Documentation] Change the scala example in sql-programmi… · 8f3cd932
      Alok Singh authored
      …ng-guide#Manually Specifying Options to be in sync with java,python, R version
      
      Author: Alok Singh <“singhal@us.ibm.com”>
      
      Closes #7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits:
      
      d3c20ba [Alok Singh] fix the file to .parquet from .json
      d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
      8f3cd932
    • Feynman Liang's avatar
      [SPARK-8457] [ML] NGram Documentation · c5532e2f
      Feynman Liang authored
      Add documentation for NGram feature transformer.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7244 from feynmanliang/SPARK-8457 and squashes the following commits:
      
      5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab
      60d5ac0 [Feynman Liang] Inline API doc and fix indentation
      736ccbc [Feynman Liang] NGram feature transformer documentation
      c5532e2f
    • Keuntae Park's avatar
      [SPARK-8783] [SQL] CTAS with WITH clause does not work · f0315437
      Keuntae Park authored
      Currently, CTESubstitution only handles the case that WITH is on the top of the plan.
      I think it SHOULD handle the case that WITH is child of CTAS.
      This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan.
      
      Author: Keuntae Park <sirpkt@apache.org>
      
      Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits:
      
      e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH
      1671c77 [Keuntae Park] WITH clause can be inside CTAS
      f0315437
    • MechCoder's avatar
      [SPARK-7785] [MLLIB] [PYSPARK] Add __str__ and __repr__ to Matrices · 2b40365d
      MechCoder authored
      Adding __str__ and  __repr__ to DenseMatrix and SparseMatrix
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6342 from MechCoder/spark-7785 and squashes the following commits:
      
      7b9a82c [MechCoder] Add tests for greater than 16 elements
      b88e9dd [MechCoder] Increment limit to 16
      1425a01 [MechCoder] Change tests
      36bd166 [MechCoder] Change str and repr representation
      97f0da9 [MechCoder] zip is same as izip in python3
      94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs
      b26fa89 [MechCoder] minor
      394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
      2b40365d
    • Shivaram Venkataraman's avatar
      [SPARK-8900] [SPARKR] Fix sparkPackages in init documentation · 374c8a8a
      Shivaram Venkataraman authored
      cc pwendell
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7293 from shivaram/sparkr-packages-doc and squashes the following commits:
      
      c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
      374c8a8a
    • Tao Li's avatar
      [SPARK-8657] [YARN] Fail to upload resource to viewfs · 26d9b6b8
      Tao Li authored
      Fail to upload resource to viewfs in spark-1.4
      JIRA Link: https://issues.apache.org/jira/browse/SPARK-8657
      
      Author: Tao Li <litao@sogou-inc.com>
      
      Closes #7125 from litao-buptsse/SPARK-8657-for-master and squashes the following commits:
      
      65b13f4 [Tao Li] [SPARK-8657] [YARN] Fail to upload resource to viewfs
      26d9b6b8
    • Reynold Xin's avatar
      [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer. · f61c989b
      Reynold Xin authored
      Just a baby step towards making it more efficient.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7282 from rxin/SPARK-8888 and squashes the following commits:
      
      3da51ae [Reynold Xin] [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.
      f61c989b
    • Wenchen Fan's avatar
      [SPARK-8753][SQL] Create an IntervalType data type · 0ba98c04
      Wenchen Fan authored
      We need a new data type to represent time intervals. Because we can't determine how many days in a month, so we need 2 values for interval: a int `months`, a long `microseconds`.
      
      The interval literal syntax looks like:
      `interval 3 years -4 month 4 weeks 3 second`
      
      Because we use number of 100ns as value of `TimestampType`, so it may not makes sense to support nano second unit.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7226 from cloud-fan/interval and squashes the following commits:
      
      632062d [Wenchen Fan] address comments
      ac348c3 [Wenchen Fan] use case class
      0342d2e [Wenchen Fan] use array byte
      df9256c [Wenchen Fan] fix style
      fd6f18a [Wenchen Fan] address comments
      1856af3 [Wenchen Fan] support interval type
      0ba98c04
    • Davies Liu's avatar
      [SPARK-5707] [SQL] fix serialization of generated projection · 74335b31
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7272 from davies/fix_projection and squashes the following commits:
      
      075ef76 [Davies Liu] fix codegen with BroadcastHashJion
      74335b31
    • Takeshi YAMAMURO's avatar
      [SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> types used in Hive UDF · 3e831a26
      Takeshi YAMAMURO authored
      To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits:
      
      916099a [Takeshi YAMAMURO] Fix style errors
      7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF
      3e831a26
    • Liang-Chi Hsieh's avatar
      [SPARK-8785] [SQL] Improve Parquet schema merging · 6722aca8
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8785
      
      Currently, the parquet schema merging (`ParquetRelation2.readSchema`) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7182 from viirya/improve_parquet_merging and squashes the following commits:
      
      5cf934f [Liang-Chi Hsieh] Refactor it to make it faster.
      f3411ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into improve_parquet_merging
      a63c3ff [Liang-Chi Hsieh] Improve Parquet schema merging.
      6722aca8
    • Sun Rui's avatar
      [SPARK-8894] [SPARKR] [DOC] Example code errors in SparkR documentation. · bf02e377
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #7287 from sun-rui/SPARK-8894 and squashes the following commits:
      
      da63898 [Sun Rui] [SPARK-8894][SPARKR][DOC] Example code errors in SparkR documentation.
      bf02e377
    • Kashif Rasul's avatar
      [SPARK-8872] [MLLIB] added verification results from R for FPGrowthSuite · 3bb21775
      Kashif Rasul authored
      Author: Kashif Rasul <kashif.rasul@gmail.com>
      
      Closes #7269 from kashif/SPARK-8872 and squashes the following commits:
      
      2d5457f [Kashif Rasul] added R code for FP Int type
      3de6808 [Kashif Rasul] added verification results from R for FPGrowthSuite
      3bb21775
    • jerryshao's avatar
      [SPARK-7050] [BUILD] Fix Python Kafka test assembly jar not found issue under Maven build · 8a9d9cc1
      jerryshao authored
       To fix Spark Streaming unit test with maven build. Previously the name and path of maven generated jar is different from sbt, which will lead to following exception. This fix keep the same behavior with both Maven and sbt build.
      
      ```
      Failed to find Spark Streaming Kafka assembly jar in /home/xyz/spark/external/kafka-assembly
      You need to build Spark with  'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or 'build/mvn package' before running this program
      ```
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5632 from jerryshao/SPARK-7050 and squashes the following commits:
      
      74b068d [jerryshao] Fix mvn build issue
      8a9d9cc1
    • Cheng Hao's avatar
      [SPARK-8883][SQL]Remove the OverrideFunctionRegistry · 351a36d0
      Cheng Hao authored
      Remove the `OverrideFunctionRegistry` from the Spark SQL, as the subclasses of `FunctionRegistry` have their own way to the delegate to the right underlying `FunctionRegistry`.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #7260 from chenghao-intel/override and squashes the following commits:
      
      164d093 [Cheng Hao] enable the function registry
      2ca8459 [Cheng Hao] remove the OverrideFunctionRegistry
      351a36d0
    • Tijo Thomas's avatar
      [SPARK-8886][Documentation]python Style update · 08192a1b
      Tijo Thomas authored
      Fixed comment given by rxin
      
      Author: Tijo Thomas <tijoparacka@gmail.com>
      
      Closes #7281 from tijoparacka/modification_for_python_style and squashes the following commits:
      
      6334e21 [Tijo Thomas] removed space
      3de4cd8 [Tijo Thomas] python Style update
      08192a1b
    • Reynold Xin's avatar
      [SPARK-8879][SQL] Remove EmptyRow class. · 61c3cf79
      Reynold Xin authored
      As a baby step towards no megamorphic InternalRow.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7277 from rxin/remove-empty-row and squashes the following commits:
      
      594100e [Reynold Xin] [SPARK-8879][SQL] Remove EmptyRow class.
      61c3cf79
  2. Jul 07, 2015
    • Reynold Xin's avatar
      [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions. · 5d603dfe
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7273 from rxin/bitwise-unittest and squashes the following commits:
      
      60c5667 [Reynold Xin] [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.
      5d603dfe
    • Yin Huai's avatar
      [SPARK-8868] SqlSerializer2 can go into infinite loop when row consists only of NullType columns · 68a4a169
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-8868
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7262 from yhuai/SPARK-8868 and squashes the following commits:
      
      cb58780 [Yin Huai] Andrew's comment.
      e456857 [Yin Huai] Josh's comments.
      5122e65 [Yin Huai] If types of all columns are NullTypes, do not use serializer2.
      68a4a169
    • Davies Liu's avatar
      [SPARK-7190] [SPARK-8804] [SPARK-7815] [SQL] unsafe UTF8String · 4ca90935
      Davies Liu authored
      Let UTF8String work with binary buffer. Before we have better idea on manage the lifecycle of UTF8String in Row, we still do the copy when calling `UnsafeRow.get()` for StringType.
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7197 from davies/unsafe_string and squashes the following commits:
      
      51b0ea0 [Davies Liu] fix test
      50c1ebf [Davies Liu] remove optimization for upper/lower case
      315d491 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string
      93fce17 [Davies Liu] address comment
      e9ff7ba [Davies Liu] clean up
      67ec266 [Davies Liu] fix bug
      7b74b1f [Davies Liu] fallback to String if local dependent
      ab7857c [Davies Liu] address comments
      7da92f5 [Davies Liu] handle local in toUpperCase/toLowerCase
      59dbb23 [Davies Liu] revert python change
      d1e0716 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string
      002e35f [Davies Liu] rollback hashCode change
      a87b7a8 [Davies Liu] improve toLowerCase and toUpperCase
      76e794a [Davies Liu] fix test
      8b2d5ce [Davies Liu] fix tests
      fd3f0a6 [Davies Liu] bug fix
      c4e9c88 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string
      c45d921 [Davies Liu] address comments
      175405f [Davies Liu] unsafe UTF8String
      4ca90935
    • Reynold Xin's avatar
      [SPARK-8876][SQL] Remove InternalRow type alias in expressions package. · 770ff102
      Reynold Xin authored
      The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7270 from rxin/internalrow and squashes the following commits:
      
      72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
      770ff102
    • Liang-Chi Hsieh's avatar
      [SPARK-8794] [SQL] Make PrunedScan work for Sample · da56c4e7
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8794
      
      Currently `PrunedScan` works only when followed by project or filter operations. However, even if there is a `Sample` between these operations and `PrunedScan`, `PrunedScan` should work too.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #7228 from viirya/sample_prunedscan and squashes the following commits:
      
      ede7cd8 [Liang-Chi Hsieh] Keep PrunedScanSuite untouched.
      6f05d30 [Liang-Chi Hsieh] Move unit test to FilterPushdownSuite.
      5f32473 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan
      7e4ba76 [Liang-Chi Hsieh] Use Optimzier for push down projection and filter.
      0686830 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan
      df82785 [Liang-Chi Hsieh] Make PrunedScan work on Sample.
      da56c4e7
    • DB Tsai's avatar
      [SPARK-8845] [ML] ML use of Breeze optimization: use adjustedValue instead of value · 3bf20c27
      DB Tsai authored
      In LinearRegression and LogisticRegression, we use Breeze's optimizers (LBFGS and OWLQN). We check the State.value to see the current objective. However, Breeze's documentation makes it sound like value and adjustedValue differ for some optimizers, possibly including OWLQN: https://github.com/scalanlp/breeze/blob/26faf622862e8d7a42a401aef601347aac655f2b/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala#L36
      If that is the case, then we should use adjustedValue instead of value. This is relevant to SPARK-8538 and SPARK-8539, where we will provide the objective trace to the user.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #7245 from dbtsai/SPARK-8845 and squashes the following commits:
      
      fa4c91e [DB Tsai] address feedback
      e6caac1 [DB Tsai] java style multiline comment
      b10c574 [DB Tsai] address feedback
      c9ff81e [DB Tsai] first commit
      3bf20c27
    • MechCoder's avatar
      [SPARK-8704] [ML] [PySpark] Add missing methods in StandardScaler · 35d781e7
      MechCoder authored
      Add std, mean to StandardScalerModel
      getVectors, findSynonyms to Word2Vec Model
      setFeatures and getFeatures to hashingTF
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7086 from MechCoder/missing_model_methods and squashes the following commits:
      
      9fbae90 [MechCoder] Add type
      6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
      35d781e7
    • Feynman Liang's avatar
      [SPARK-8559] [MLLIB] Support Association Rule Generation · 3336c7b1
      Feynman Liang authored
      Distributed generation of single-consequent association rules from a RDD of frequent itemsets. Tests referenced against `R`'s implementation of A Priori in [arules](http://cran.r-project.org/web/packages/arules/index.html).
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7005 from feynmanliang/fp-association-rules-distributed and squashes the following commits:
      
      466ced0 [Feynman Liang] Refactor AR generation impl
      73c1cff [Feynman Liang] Make rule attributes public, remove numTransactions from FreqItemset
      80f63ff [Feynman Liang] Change default confidence and optimize imports
      04cf5b5 [Feynman Liang] Code review with @mengxr, add R to tests
      0cc1a6a [Feynman Liang] Java compatibility test
      f3c14b5 [Feynman Liang] Fix MiMa test
      764375e [Feynman Liang] Fix tests
      1187307 [Feynman Liang] Almost working tests
      b20779b [Feynman Liang] Working implementation
      5395c4e [Feynman Liang] Fix imports
      2d34405 [Feynman Liang] Partial implementation of distributed ar
      83ace4b [Feynman Liang] Local rule generation without pruning complete
      69c2c87 [Feynman Liang] Working local implementation, now to parallelize../..
      4e1ec9a [Feynman Liang] Pull FreqItemsets out, refactor type param, tests
      69ccedc [Feynman Liang] First implementation of association rule generation
      3336c7b1
    • Simon Hafner's avatar
      [SPARK-8821] [EC2] Switched to binary mode for file reading · 70beb808
      Simon Hafner authored
      
      Otherwise the script will crash with
      
          - Downloading boto...
          Traceback (most recent call last):
            File "ec2/spark_ec2.py", line 148, in <module>
              setup_external_libs(external_libs)
            File "ec2/spark_ec2.py", line 128, in setup_external_libs
              if hashlib.md5(tar.read()).hexdigest() != lib["md5"]:
            File "/usr/lib/python3.4/codecs.py", line 319, in decode
              (result, consumed) = self._buffer_decode(data, self.errors, final)
          UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
      
      In case of an utf8 env setting.
      
      Author: Simon Hafner <hafnersimon@gmail.com>
      
      Closes #7215 from reactormonk/branch-1.4 and squashes the following commits:
      
      e86957a [Simon Hafner] [SPARK-8821] [EC2] Switched to binary mode
      
      (cherry picked from commit 83a621a5)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      70beb808
    • MechCoder's avatar
      [SPARK-8823] [MLLIB] [PYSPARK] Optimizations for SparseVector dot products · 738c1074
      MechCoder authored
      Follow up for https://github.com/apache/spark/pull/5946
      
      Currently we iterate over indices and values in SparseVector and can be vectorized.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7222 from MechCoder/sparse_optim and squashes the following commits:
      
      dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
      738c1074
    • MechCoder's avatar
      [SPARK-8711] [ML] Add additional methods to PySpark ML tree models · 1dbc4a15
      MechCoder authored
      Add numNodes and depth to treeModels, add treeWeights to ensemble Models.
      Add __repr__ to all models.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7095 from MechCoder/missing_methods_tree and squashes the following commits:
      
      23b08be [MechCoder] private [spark]
      38a0860 [MechCoder] rename pyTreeWeights to javaTreeWeights
      6d16ad8 [MechCoder] Fix Python 3 Error
      47d7023 [MechCoder] Use np.allclose and treeEnsembleModel -> TreeEnsembleMethods
      819098c [MechCoder] [SPARK-8711] [ML] Add additional methods ot PySpark ML tree models
      1dbc4a15
    • Mike Dusenberry's avatar
      [SPARK-8570] [MLLIB] [DOCS] Improve MLlib Local Matrix Documentation. · 0a63d7ab
      Mike Dusenberry authored
      Updated MLlib Data Types Local Matrix section to include information on sparse matrices, added sparse matrix examples to the Scala and Java examples, and added Python examples for both dense and sparse matrices.
      
      Author: Mike Dusenberry <mwdusenb@us.ibm.com>
      
      Closes #6958 from dusenberrymw/Improve_MLlib_Local_Matrix_Documentation and squashes the following commits:
      
      ceae407 [Mike Dusenberry] Updated MLlib Data Types Local Matrix section to include information on sparse matrices, added sparse matrix examples to the Scala and Java examples, and added Python examples for both dense and sparse matrices.
      0a63d7ab
    • Yanbo Liang's avatar
      [SPARK-8788] [ML] Add Java unit test for PCA transformer · d73bc08d
      Yanbo Liang authored
      Add Java unit test for PCA transformer
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7184 from yanboliang/spark-8788 and squashes the following commits:
      
      9d1a2af [Yanbo Liang] address comments
      b34451f [Yanbo Liang] Add Java unit test for PCA transformer
      d73bc08d
    • Sean Owen's avatar
      [SPARK-6731] [CORE] Addendum: Upgrade Apache commons-math3 to 3.4.1 · dcbd85b7
      Sean Owen authored
      (This finishes the job by removing the version overridden by Hadoop profiles.)
      
      See discussion at https://github.com/apache/spark/pull/6994#issuecomment-119113167
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7261 from srowen/SPARK-6731.2 and squashes the following commits:
      
      5a3f59e [Sean Owen] Finish updating Commons Math3 to 3.4.1 from 3.1.1
      dcbd85b7
    • Patrick Wendell's avatar
      [HOTFIX] Rename release-profile to release · 1cb2629f
      Patrick Wendell authored
      when publishing releases. We named it as 'release-profile' because that is
      the Maven convention. However, it turns out this special name causes several
      other things to kick-in when we are creating releases that are not desirable.
      For instance, it triggers the javadoc plugin to run, which actually fails
      in our current build set-up.
      
      The fix is just to rename this to a different profile to have no
      collateral damage associated with its use.
      1cb2629f
    • Wenchen Fan's avatar
      [SPARK-8759][SQL] add default eval to binary and unary expression according to... · c46aaf47
      Wenchen Fan authored
      [SPARK-8759][SQL] add default eval to binary and unary expression according to default behavior of nullable
      
      We have `nullSafeCodeGen` to provide default code generation for binary and unary expression, and we can do the same thing for `eval`.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7157 from cloud-fan/refactor and squashes the following commits:
      
      f3987c6 [Wenchen Fan] refactor Expression
      c46aaf47
  3. Jul 06, 2015
    • Alok  Singh's avatar
      [SPARK-5562] [MLLIB] LDA should handle empty document. · 6718c1eb
      Alok Singh authored
      See the jira https://issues.apache.org/jira/browse/SPARK-5562
      
      Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>
      Author: Alok  Singh <singhal@aloks-mbp.usca.ibm.com>
      Author: Alok Singh <“singhal@us.ibm.com”>
      
      Closes #7064 from aloknsingh/aloknsingh_SPARK-5562 and squashes the following commits:
      
      259a0a7 [Alok Singh] change as per the comments by @jkbradley
      be48491 [Alok  Singh] [SPARK-5562][MLlib] re-order import in alphabhetical order
      c01311b [Alok  Singh] [SPARK-5562][MLlib] fix the newline typo
      b271c8a [Alok  Singh] [SPARK-5562][Mllib] As per github discussion with jkbradley. We would like to simply things.
      7c06251 [Alok  Singh] [SPARK-5562][MLlib] modified the JavaLDASuite for test passing
      c710cb6 [Alok  Singh] fix the scala code style to have space after :
      2572a08 [Alok  Singh] [SPARK-5562][MLlib] change the import xyz._ to the import xyz.{c1, c2} ..
      ab55fbf [Alok  Singh] [SPARK-5562][MLlib] Change as per Sean Owen's comments https://github.com/apache/spark/pull/7064/files#diff-9236d23975e6f5a5608ffc81dfd79146
      9f4f9ea [Alok  Singh] [SPARK-5562][MLlib] LDA should handle empty document.
      6718c1eb
    • Takeshi YAMAMURO's avatar
      [SPARK-6747] [SQL] Throw an AnalysisException when unsupported Java list types used in Hive UDF · 1821fc16
      Takeshi YAMAMURO authored
      The current implementation can't handle List<> as a return type in Hive UDF and
      throws meaningless Match Error.
      We assume an UDF below;
      public class UDFToListString extends UDF {
      public List<String> evaluate(Object o)
      { return Arrays.asList("xxx", "yyy", "zzz"); }
      }
      An exception of scala.MatchError is thrown as follows when the UDF used;
      scala.MatchError: interface java.util.List (of class java.lang.Class)
      at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
      at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
      at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
      at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
      at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
      at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
      at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
      at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
      at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
      ...
      To make udf developers more understood, we need to throw a more suitable exception.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #7248 from maropu/FixBugInHiveInspectors and squashes the following commits:
      
      1c3df2a [Takeshi YAMAMURO] Fix comments
      56305de [Takeshi YAMAMURO] Fix conflicts
      92ed7a6 [Takeshi YAMAMURO] Throw an exception when java list type used
      2844a8e [Takeshi YAMAMURO] Apply comments
      7114a47 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite
      fdb2ae4 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String
      af61f2e [Takeshi YAMAMURO] Remove a new type
      7f812fd [Takeshi YAMAMURO] Fix code-style errors
      6984bf4 [Takeshi YAMAMURO] Apply review comments
      93e3d4e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString
      ee232db [Takeshi YAMAMURO] Support List as a return type in Hive UDF
      1e82316 [Takeshi YAMAMURO] Apply comments
      21e8763 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite
      a488712 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String
      1c7b9d1 [Takeshi YAMAMURO] Remove a new type
      f965c34 [Takeshi YAMAMURO] Fix code-style errors
      9406416 [Takeshi YAMAMURO] Apply review comments
      e21ce7e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString
      e553f10 [Takeshi YAMAMURO] Support List as a return type in Hive UDF
      1821fc16
    • Andrew Or's avatar
      Revert "[SPARK-8781] Fix variables in published pom.xml are not resolved" · 929dfa24
      Andrew Or authored
      This reverts commit 82cf3315.
      
      Conflicts:
      	pom.xml
      929dfa24
Loading