Skip to content
Snippets Groups Projects
  1. Jun 03, 2015
    • Reynold Xin's avatar
      [SPARK-8060] Improve DataFrame Python test coverage and documentation. · ce320cb2
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:
      
      baa8ad5 [Reynold Xin] Code review feedback.
      f081d47 [Reynold Xin] More documentation updates.
      c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
      ce320cb2
    • MechCoder's avatar
      [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust · 452eb82d
      MechCoder authored
      The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`
      
      It fails in my system since I have version `1.10` :P
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6579 from MechCoder/np_ver and squashes the following commits:
      
      15430f8 [MechCoder] fix syntax error
      893fb7e [MechCoder] remove equal to
      e35f0d4 [MechCoder] minor
      e89376c [MechCoder] Better checking
      22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
      452eb82d
    • Yuhao Yang's avatar
      [SPARK-8043] [MLLIB] [DOC] update NaiveBayes and SVM examples in doc · 43adbd56
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8043
      
      I found some issues during testing the save/load examples in markdown Documents, as a part of 1.4 QA plan
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6584 from hhbyyh/naiveDocExample and squashes the following commits:
      
      a01a206 [Yuhao Yang] fix for Gaussian mixture
      2fb8b96 [Yuhao Yang] update NaiveBayes and SVM examples in doc
      43adbd56
    • WangTaoTheTonic's avatar
      [MINOR] make the launcher project name consistent with others · ccaa8232
      WangTaoTheTonic authored
      I found this by chance while building spark and think it is better to keep its name consistent with other sub-projects (Spark Project *).
      
      I am not gonna file JIRA as it is a pretty small issue.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #6603 from WangTaoTheTonic/projName and squashes the following commits:
      
      994b3ba [WangTaoTheTonic] make the project name consistent
      ccaa8232
    • Joseph K. Bradley's avatar
      [SPARK-8053] [MLLIB] renamed scalingVector to scalingVec · 07c16cb5
      Joseph K. Bradley authored
      I searched the Spark codebase for all occurrences of "scalingVector"
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6596 from jkbradley/scalingVec-rename and squashes the following commits:
      
      d3812f8 [Joseph K. Bradley] renamed scalingVector to scalingVec
      07c16cb5
    • Josh Rosen's avatar
      [SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors · cafd5056
      Josh Rosen authored
      This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.
      
      At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`.  In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods.  This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.
      
      The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:
      
      - #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
      - #6218: DataFrame.describe() should cast all aggregates to String
      - #6400: Use output schema, not relation schema, for data source input conversion
      
      Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema.  According to the `createDataFrame()` Scaladoc:
      
      >  It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.
      
      Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats.  This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.
      
      In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows.  Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch.  Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:
      
      740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
      befc613 [Josh Rosen] Add tests to document Option-handling behavior.
      5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
      6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
      3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
      6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
      677ff27 [Josh Rosen] Fix null handling bug; add tests.
      8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
      85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
      9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
      ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
      7ca7fcb [Josh Rosen] Comments and cleanup
      1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
      cafd5056
  2. Jun 02, 2015
    • DB Tsai's avatar
      [SPARK-7547] [ML] Scala Example code for ElasticNet · a86b3e9b
      DB Tsai authored
      This is scala example code for both linear and logistic regression. Python and Java versions are to be added.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6576 from dbtsai/elasticNetExample and squashes the following commits:
      
      e7ca406 [DB Tsai] fix test
      6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
      136e0dd [DB Tsai] address feedback
      1ec29d4 [DB Tsai] fix style
      9462f5f [DB Tsai] add example
      a86b3e9b
    • Ram Sriharsha's avatar
      [SPARK-7387] [ML] [DOC] CrossValidator example code in Python · c3f4c325
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6358 from harsha2010/SPARK-7387 and squashes the following commits:
      
      63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
      aeb6bb6 [Ram Sriharsha] Python Style Fix
      54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      615e91c [Ram Sriharsha] cleanup
      204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python
      c3f4c325
    • Cheng Lian's avatar
      [SQL] [TEST] [MINOR] Follow-up of PR #6493, use Guava API to ensure Java 6 friendliness · 5cd6a63d
      Cheng Lian authored
      This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness.
      
      cc andrewor14 pwendell, this should also be back ported to branch-1.4.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6547 from liancheng/override-log4j and squashes the following commits:
      
      c900cfd [Cheng Lian] Addresses Shixiong's comment
      72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness
      5cd6a63d
    • Xiangrui Meng's avatar
      [SPARK-8049] [MLLIB] drop tmp col from OneVsRest output · 89f21f66
      Xiangrui Meng authored
      The temporary column should be dropped after we get the prediction column. harsha2010
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6592 from mengxr/SPARK-8049 and squashes the following commits:
      
      1d89107 [Xiangrui Meng] use SparkFunSuite
      6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output
      89f21f66
    • Davies Liu's avatar
      [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() · 605ddbb2
      Davies Liu authored
      Thanks ogirardot, closes #6580
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6590 from davies/when and squashes the following commits:
      
      c0f2069 [Davies Liu] fix Column.when() and otherwise()
      605ddbb2
    • Cheng Lian's avatar
      [SPARK-8014] [SQL] Avoid premature metadata discovery when writing a... · 686a45f0
      Cheng Lian authored
      [SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append
      
      The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.
      
      This PR fixes this issue by deferring metadata discovery after save mode checking.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6583 from liancheng/spark-8014 and squashes the following commits:
      
      1aafabd [Cheng Lian] Updates comments
      088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
      8fbd93f [Cheng Lian] Fixes SPARK-8014
      686a45f0
    • Mike Dusenberry's avatar
      [SPARK-7985] [ML] [MLlib] [Docs] Remove "fittingParamMap" references. Updating... · ad06727f
      Mike Dusenberry authored
      [SPARK-7985] [ML] [MLlib] [Docs] Remove "fittingParamMap" references. Updating ML Doc "Estimator, Transformer, and Param" examples.
      
      Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.
      
      mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](https://github.com/apache/spark/pull/5820).
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:
      
      6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
      d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
      0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
      7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
      ad06727f
    • Marcelo Vanzin's avatar
      [SPARK-8015] [FLUME] Remove Guava dependency from flume-sink. · 0071bd8d
      Marcelo Vanzin authored
      The minimal change would be to disable shading of Guava in the module,
      and rely on the transitive dependency from other libraries instead. But
      since Guava's use is so localized, I think it's better to just not use
      it instead, so I replaced that code and removed all traces of Guava from
      the module's build.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6555 from vanzin/SPARK-8015 and squashes the following commits:
      
      c0ceea8 [Marcelo Vanzin] Add comments about dependency management.
      c38228d [Marcelo Vanzin] Add guava dep in test scope.
      b7a0349 [Marcelo Vanzin] Add libthrift exclusion.
      6e0942d [Marcelo Vanzin] Add comment in pom.
      2d79260 [Marcelo Vanzin] [SPARK-8015] [flume] Remove Guava dependency from flume-sink.
      0071bd8d
    • Cheng Lian's avatar
      [SPARK-8037] [SQL] Ignores files whose name starts with dot in HadoopFsRelation · 1bb5d716
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6581 from liancheng/spark-8037 and squashes the following commits:
      
      d08e97b [Cheng Lian] Ignores files whose name starts with dot in HadoopFsRelation
      1bb5d716
    • Xiangrui Meng's avatar
      [SPARK-7432] [MLLIB] fix flaky CrossValidator doctest · bd97840d
      Xiangrui Meng authored
      The new test uses CV to compare `maxIter=0` and `maxIter=1`, and validate on the evaluation result. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6572 from mengxr/SPARK-7432 and squashes the following commits:
      
      c236bb8 [Xiangrui Meng] fix flacky cv doctest
      bd97840d
    • Davies Liu's avatar
      [SPARK-8021] [SQL] [PYSPARK] make Python read/write API consistent with Scala · 445647a1
      Davies Liu authored
      add schema()/format()/options() for reader,  add mode()/format()/options()/partitionBy() for writer
      
      cc rxin yhuai  pwendell
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6578 from davies/readwrite and squashes the following commits:
      
      720d293 [Davies Liu] address comments
      b65dfa2 [Davies Liu] Update readwriter.py
      1299ab6 [Davies Liu] make Python API consistent with Scala
      445647a1
    • Yin Huai's avatar
      [SPARK-8023][SQL] Add "deterministic" attribute to Expression to avoid... · 0f80990b
      Yin Huai authored
      [SPARK-8023][SQL] Add "deterministic" attribute to Expression to avoid collapsing nondeterministic projects.
      
      This closes #6570.
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6573 from rxin/deterministic and squashes the following commits:
      
      356cd22 [Reynold Xin] Added unit test for the optimizer.
      da3fde1 [Reynold Xin] Merge pull request #6570 from yhuai/SPARK-8023
      da56200 [Yin Huai] Comments.
      e38f264 [Yin Huai] Comment.
      f9d6a73 [Yin Huai] Add a deterministic method to Expression.
      0f80990b
    • Yin Huai's avatar
      [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get... · 7b7f7b6c
      Yin Huai authored
      [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early
      
      https://issues.apache.org/jira/browse/SPARK-8020
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6571 from yhuai/SPARK-8020-1 and squashes the following commits:
      
      0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.
      7b7f7b6c
    • Davies Liu's avatar
      [SPARK-6917] [SQL] DecimalType is not read back when non-native type exists · bcb47ad7
      Davies Liu authored
      cc yhuai
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6558 from davies/decimalType and squashes the following commits:
      
      c877ca8 [Davies Liu] Update ParquetConverter.scala
      48cc57c [Davies Liu] Update ParquetConverter.scala
      b43845c [Davies Liu] add test
      3b4a94f [Davies Liu] DecimalType is not read back when non-native type exists
      bcb47ad7
    • Xiangrui Meng's avatar
      [SPARK-7582] [MLLIB] user guide for StringIndexer · 0221c7f0
      Xiangrui Meng authored
      This PR adds a Java unit test and user guide for `StringIndexer`. I put it before `OneHotEncoder` because they are closely related. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6561 from mengxr/SPARK-7582 and squashes the following commits:
      
      4bba4f1 [Xiangrui Meng] fix example
      ba1cd1b [Xiangrui Meng] fix style
      7fa18d1 [Xiangrui Meng] add user guide for StringIndexer
      136cb93 [Xiangrui Meng] add a Java unit test for StringIndexer
      0221c7f0
  3. Jun 01, 2015
  4. May 31, 2015
    • Wenchen Fan's avatar
      [SPARK-7952][SPARK-7984][SQL] equality check between boolean type and numeric type is broken. · a0e46a0d
      Wenchen Fan authored
      The origin code has several problems:
      * `true <=> 1` will return false as we didn't set a rule to handle it.
      * `true = a` where `a` is not `Literal` and its value is 1, will return false as we only handle literal values.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6505 from cloud-fan/tmp1 and squashes the following commits:
      
      77f0f39 [Wenchen Fan] minor fix
      b6401ba [Wenchen Fan] add type coercion for CaseKeyWhen and address comments
      ebc8c61 [Wenchen Fan] use SQLTestUtils and If
      625973c [Wenchen Fan] improve
      9ba2130 [Wenchen Fan] address comments
      fc0d741 [Wenchen Fan] fix style
      2846a04 [Wenchen Fan] fix 7952
      a0e46a0d
    • Davies Liu's avatar
      [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton · 91777a1c
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6532 from davies/decimal and squashes the following commits:
      
      c7fcbce [Davies Liu] Update tests.py
      1425359 [Davies Liu] DecimalType should not be singleton
      91777a1c
    • Reynold Xin's avatar
      [SPARK-7986] Split scalastyle config into 3 sections. · 6f006b5f
      Reynold Xin authored
      (1) rules that we enforce.
      (2) rules that we would like to enforce, but haven't cleaned up the codebase to
          turn on yet (or we need to make the scalastyle rule more configurable).
      (3) rules that we don't want to enforce.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6543 from rxin/scalastyle and squashes the following commits:
      
      beefaab [Reynold Xin] [SPARK-7986] Split scalastyle config into 3 sections.
      6f006b5f
Loading